Algorithms and Architectures for Parallel Processing

The four-volume set LNCS 11334-11337 constitutes the proceedings of the 18th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2018, held in Guangzhou, China, in November 2018. The 141 full and 50 short papers presented were carefully reviewed and selected from numerous submissions. The papers are organized in topical sections on Distributed and Parallel Computing; High Performance Computing; Big Data and Information Processing; Internet of Things and Cloud Computing; and Security and Privacy in Computing.


105 downloads 2K Views 71MB Size

Recommend Stories

Empty story

Idea Transcript


LNCS 11336

Jaideep Vaidya Jin Li (Eds.)

Algorithms and Architectures for Parallel Processing 18th International Conference, ICA3PP 2018 Guangzhou, China, November 15–17, 2018 Proceedings, Part III

123

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany

11336

More information about this series at http://www.springer.com/series/7407

Jaideep Vaidya Jin Li (Eds.) •

Algorithms and Architectures for Parallel Processing 18th International Conference, ICA3PP 2018 Guangzhou, China, November 15–17, 2018 Proceedings, Part III

123

Editors Jaideep Vaidya Rutgers University Newark, NJ, USA

Jin Li Guangzhou University Guangzhou, China

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-05056-6 ISBN 978-3-030-05057-3 (eBook) https://doi.org/10.1007/978-3-030-05057-3 Library of Congress Control Number: 2018962485 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Welcome to the proceedings of the 18th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2018), which was organized by Guangzhou University and held in Guangzhou, China, during November 15–17, 2018. ICA3PP 2018 was the 18th event in a series of conferences devoted to research on algorithms and architectures for parallel processing. Previous iterations of the conference include ICA3PP 2017 (Helsinki, Finland, November 2017), ICA3PP 2016 (Granada, Spain, December 2016), ICA3PP 2015 (Zhangjiajie, China, November 2015), ICA3PP 2014 (Dalian, China, August 2014), ICA3PP 2013 (Vietri sul Mare, Italy, December 2013), ICA3PP 2012 (Fukuoka, Japan, September 2012), ICA3PP 2011 (Melbourne, Australia, October 2011), ICA3PP 2010 (Busan, Korea, May 2010), ICA3PP 2009 (Taipei, Taiwan, June 2009), ICA3PP 2008 (Cyprus, June 2008), ICA3PP 2007 (Hangzhou, China, June 2007), ICA3PP 2005 (Melbourne, Australia, October 2005), ICA3PP 2002 (Beijing, China, October 2002), ICA3PP 2000 (Hong Kong, China, December 2000), ICA3PP 1997 (Melbourne, Australia, December 1997), ICA3PP 1996 (Singapore, June 1996), and ICA3PP 1995 (Brisbane, Australia, April 1995). ICA3PP is now recognized as the main regular event in the area of parallel algorithms and architectures, which covers many dimensions including fundamental theoretical approaches, practical experimental projects, and commercial and industry applications. This conference provides a forum for academics and practitioners from countries and regions around the world to exchange ideas for improving the efficiency, performance, reliability, security, and interoperability of computing systems and applications. ICA3PP 2018 attracted over 400 high-quality research papers highlighting the foundational work that strives to push beyond the limits of existing technologies, including experimental efforts, innovative systems, and investigations that identify weaknesses in existing parallel processing technology. Each submission was reviewed by at least two experts in the relevant areas, on the basis of their significance, novelty, technical quality, presentation, and practical impact. According to the review results, 141 full papers were selected to be presented at the conference, giving an acceptance rate of 35%. Besides, we also accepted 50 short papers and 24 workshop papers. In addition to the paper presentations, the program of the conference included four keynote speeches and two invited talks from esteemed scholars in the area, namely: Prof. Xuemin (Sherman) Shen, University of Waterloo, Canada; Prof. Wenjing Lou, Virginia Tech, USA; Prof. Witold Pedrycz, University of Alberta, Canada; Prof. Xiaohua Jia, City University of Hong Kong, Hong Kong; Prof. Xiaofeng Chen, Xidian University, China; Prof. Xinyi Huang, Fujian Normal University, China. We were extremely honored to have them as the conference keynote speakers and invited speakers. ICA3PP 2018 was made possible by the behind-the-scene effort of selfless individuals and organizations who volunteered their time and energy to ensure the success

VI

Preface

of this conference. We would like to express our special appreciation to Prof. Yang Xiang, Prof. Weijia Jia, Prof. Yi Pan, Prof. Laurence T. Yang, and Prof. Wanlei Zhou, the Steering Committee members, for giving us the opportunity to host this prestigious conference and for their guidance with the conference organization. We would like to emphasize our gratitude to the general chairs, Prof. Albert Zomaya and Prof. Minyi Guo, for their outstanding support in organizing the event. Thanks also to the publicity chairs, Prof. Zheli Liu and Dr Weizhi Meng, for the great job in publicizing this event. We would like to give our thanks to all the members of the Organizing Committee and Program Committee for their efforts and support. The ICA3PP 2018 program included two workshops, namely, the ICA3PP 2018 Workshop on Intelligent Algorithms for Large-Scale Complex Optimization Problems and the ICA3PP 2018 Workshop on Security and Privacy in Data Processing. We would like to express our sincere appreciation to the workshop chairs: Prof. Ting Hu, Prof. Feng Wang, Prof. Hongwei Li and Prof. Qian Wang. Last but not least, we would like to thank all the contributing authors and all conference attendees, as well as the great team at Springer that assisted in producing the conference proceedings, and the developers and maintainers of EasyChair. November 2018

Jaideep Vaidya Jin Li

Organization

General Chairs Albert Zomaya Minyi Guo

University of Sydney, Australia Shanghai Jiao Tong University, China

Program Chairs Jaideep Vaidya Jin Li

Rutgers University, USA Guangzhou University, China

Publication Chair Yu Wang

Guangzhou University, China

Publicity Chairs Zheli Liu Weizhi Meng

Nankai University, China Technical University of Denmark, Denmark

Steering Committee Yang Xiang (Chair) Weijia Jia Yi Pan Laurence T. Yang Wanlei Zhou

Swinburne University of Technology, Australia Shanghai Jiaotong University, China Georgia State University, USA St. Francis Xavier University, Canada Deakin University, Australia

Program Committee Pedro Alonso Daniel Andresen Cosimo Anglano Danilo Ardagna Kapil Arya Marcos Assuncao Joonsang Baek Anirban Basu Ladjel Bellatreche Jorge Bernal Bernabe Thomas Boenisch

Universitat Politècnica de València, Spain Kansas State University, USA Universitá del Piemonte Orientale, Italy Politecnico di Milano, Italy Northeastern University, USA Inria, France University of Wollongong, Australia KDDI Research Inc., Japan LIAS/ENSMA, France University of Murcia, Spain High-Performance Computing Center Stuttgart, Germany

VIII

Organization

George Bosilca Massimo Cafaro Philip Carns Alexandra Carpen-Amarie Aparicio Carranza Aniello Castiglione Arcangelo Castiglione Pedro Castillo Tzung-Shi Chen Kim-Kwang Raymond Choo Mauro Conti Jose Alfredo Ferreira Costa Raphaël Couturier Miguel Cárdenas Montes Masoud Daneshtalab Casimer Decusatis Eugen Dedu Juan-Carlos Díaz-Martín Matthieu Dorier Avgoustinos Filippoupolitis Ugo Fiore Franco Frattolillo Marc Frincu Jorge G. Barbosa Chongzhi Gao Jose Daniel García Luis Javier García Villalba Paolo Gasti Vladimir Getov Olivier Gluck Jing Gong Amina Guermouche Jeff Hammond Feng Hao Houcine Hassan Sun-Yuan Hsieh Chengyu Hu Xinyi Huang Mauro Iacono Shadi Ibrahim Yasuaki Ito Mathias Jacquelin Nan Jiang Lu Jiaxin

University of Tennessee, USA University of Salento, Italy Argonne National Laboratory, USA Vienna University of Technology, Austria City University of New York, USA University of Salerno, Italy University of Salerno, Italy University of Granada, Spain National University of Tainan, Taiwan The University of Texas at San Antonio, USA University of Padua, Italy Federal University, UFRN, Brazil University Bourgogne Franche-Comté, France CIEMAT, Spain Mälardalen University and Royal Institute of Technology, Sweden Marist College, USA University of Bourgogne Franche-Comté, France University of Extremadura, Spain Argonne National Laboratory, USA University of Greenwich, UK Federico II University, Italy University of Sannio, Italy West University of Timisoara, Romania University of Porto, Portugal Guangzhou University, China University Carlos III of Madrid, Spain Universidad Complutense de Madrid, Spain New York Institute of Technology, USA University of Westminster, UK Université de Lyon, France KTH Royal Institute of Technology, Sweden Telecom Sud-Paris, France Intel, USA Newcastle University, UK Universitat Politècnica de València, Spain National Cheng Kung University, Taiwan Shandong University, China Fujian Normal University, China University of Campania Luigi Vanvitelli, Italy Inria, France Hiroshima University, Japan Lawrence Berkeley National Laboratory, USA East China Jiaotong University, China Jiangxi Normal University, China

Organization

Edward Jung Georgios Kambourakis Gabor Kecskemeti Muhammad Khurram Khan Dieter Kranzlmüller Michael Kuhn Julian Kunkel Algirdas Lančinskas Patrick P. C. Lee Laurent Lefevre Hui Li Kenli Li Dan Liao Jingyu Liu Joseph Liu Yunan Liu Zheli Liu Jay Lofstead Paul Lu Amit Majumdar Tomas Margalef Stefano Markidis Alejandro Masrur Susumu Matsumae Raffaele Montella Francesco Moscato Bogdan Nicolae Francesco Palmieri Swann Perarnau Dana Petcu Salvador Petit Riccardo Petrolo Florin Pop Radu Prodan Zhang Qikun Thomas Rauber Khaled Riad Suzanne Rivoire Ivan Rodero Romain Rouvoy Antonio Ruiz-Martínez Françoise Sailhan Sherif Sakr Giandomenico Spezzano

IX

Kennesaw State University, USA University of the Aegean, Greece Liverpool John Moores University, UK King Saud University, Saudi Arabia Ludwig Maximilian University of Munich, Germany University of Hamburg, Germany German Climate Computing Center, Germany Vilnius University, Lithuania The Chinese University of Hong Kong, SAR China Inria, France University of Electronic Science and Technology of China, China Hunan University, China University of Electronic Science and Technology of China, China Hebei University of Technology, China Monash University, Australia Jiangxi Normal University, China Nankai University, China Sandia National Laboratories, USA University of Alberta, Canada University of California San Diego, USA Universitat Autonoma de Barcelona, Spain KTH Royal Institute of Technology, Sweden Chemnitz University of Technology, Germany Saga University, Japan University of Naples Parthenope, Italy University of Campania Luigi Vanvitelli, Italy Argonne National Laboratory, Germany University of Salerno, Italy, Italy Argonne National Laboratory, USA West University of Timisoara, Romania Universitat Politècnica de València, Spain Rice University, USA University Politehnica of Bucharest, Romania University of Klagenfurt, Austria Beijing Institute of Technology, China University Bayreuth, Germany Zagazig University, Egypt Sonoma State University, USA Rutgers University, USA University of Lille, France University of Murcia, Spain CNAM, France The University of New South Wales, Australia ICAR-CNR and University of Calabria, Italy

X

Organization

Patricia Stolf John Stone Peter Strazdins Hari Subramoni Gang Sun Zhizhuo Sun Frederic Suter Yu-An Tan Ming Tao Andrei Tchernykh Massimo Torquati Tomoaki Tsumura Didem Unat Vladimir Voevodin Feng Wang Hao Wang Yu Wei Sheng Wen Jigang Wu Roman Wyrzykowski Yu Xiao Ramin Yahyapour Fang Yan Zheng Yan Laurence T. Yang Wun-She Yap

IRIT, France University of Illinois at Urbana-Champaign, USA The Australian National University, Australia The Ohio State University, USA University of Science and Technology of China, China Beijing Institute of Technology, China CNRS, France Beijing Institute of Technology, China Dongguan University of Technology, China CICESE Research Center, Mexico University of Pisa, Italy Nagoya Institute of Technology, Japan Koç University, Turkey Moscow University, Russia Wuhan University, China Shandong Normal University, China Nankai University, China Swinbourne University of Technology, China Guangdong University of Technology, China Czestochowa University of Technology, Poland Shandong University of Technology, China University of Göttingen, Germany Beijing Wuzi University, China Xidian University, China St. Francis Xavier University, Canada Universiti Tunku Abdul Rahman, Malaysia

Contents – Part III

Big Data and Information Processing TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation in Cooperative Cognitive Radio Networks . . . . . . . . . . . . . . . . . . . . . . . . . Xinxiang Zhang, Jigang Wu, and Long Chen QoS-Driven Service Matching Algorithm Based on User Requirements . . . . . Mengying Guo and Xudong Yang Research on Overload Classification Method for Bus Images Based on Image Processing and SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tingting Li, Yongxiong Sun, Yanhua Liang, Yujia Zhai, and Xuan Ji Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost . . . . Haojun Ai, Liangliang Han, Yifeng Wang, and Liang Liao An Approach of Collecting Performance Anomaly Dataset for NFV Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingfeng Du, Yu He, Tiandi Xie, Kanglin Yin, and Juan Qiu

3 17

28

44

59

An Axiomatization for BSP Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoann Marquer and Frédéric Gava

72

Efficient and Secure Outsourced Linear Regression . . . . . . . . . . . . . . . . . . . Haomiao Yang, Weichao He, Qixian Zhou, and Hongwei Li

89

New Multi-objectives Scheduling Strategies in Docker SwarmKit . . . . . . . . . Tarek Menouer, Christophe Cérin, and Étienne Leclercq

103

Internet Performance Prediction Framework Based on PingER Dataset. . . . . . Wei Zhang, Xiaofei Xing, Saqib Ali, and Guojun Wang

118

MS-RAID: An Energy-Saving Data Layout for CDP . . . . . . . . . . . . . . . . . . Jingyu Liu, Ziyao Zhang, Lu Liu, and Xin Chai

132

Incentivizing Multimedia Data Acquisition for Machine Learning System . . . Yiren Gu, Hang Shen, Guangwei Bai, Tianjing Wang, Hai Tong, and Yujia Hu

142

Toward Performance Prediction for Multi-BSP Programs in ML . . . . . . . . . . Victor Allombert, Frédéric Gava, and Julien Tesson

159

XII

Contents – Part III

Exploiting the Table of Energy and Power Leverages . . . . . . . . . . . . . . . . . Issam Raïs, Laurent Lefèvre, Anne-Cécile Orgerie, and Anne Benoit

175

A Semantic Web Based Intelligent IoT Model . . . . . . . . . . . . . . . . . . . . . . Chao Qu, Ming Tao, Jie Zhang, Xiaoyu Hong, and Ruifen Yuan

186

Accelerating CNNs Using Optimized Scheduling Strategy . . . . . . . . . . . . . . Rui Xu, Sheng Ma, Wenwu Li, and Yang Guo

196

Data Analysis of Blended Learning in Python Programming. . . . . . . . . . . . . Qian Chu, Xiaomei Yu, Yuli Jiang, and Hong Wang

209

APs Deployment Optimization for Indoor Fingerprint Positioning with Adaptive Particle Swarm Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . Jianhui Zhao, Jun Li, Haojun Ai, and Bo Cai

218

Deployment Optimization of Indoor Positioning Signal Sources with Fireworks Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianhui Zhao, Shiqi Wen, Haojun Ai, and Bo Cai

229

A Study of Sleep Stages Threshold Based on Multiscale Fuzzy Entropy . . . . Xuexiao Shao, Bin Hu, Yalin Li, and Xiangwei Zheng

239

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels . . . Jing Liu, Kun Han, Wenhua Wu, Shu Wang, and Xiao Yu

249

Facial Shape and Expression Transfer via Non-rigid Image Deformation . . . . Huabing Zhou, Shiqiang Ren, Yong Zhou, Yuyu Kuang, Yanduo Zhang, Wei Zhang, Tao Lu, Hanwen Chen, and Deng Chen

257

P-Schedule: Erasure Coding Schedule Strategy in Big Data Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao Yin, Haitao Lv, Tongfang Li, Yan Liu, Xiaoping Qu, and Sihao Yuan Answer Aggregation of Crowdsourcing Employing an Improved EM-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ran Zhang, Lei Liu, Lizhen Cui, Wei He, and Hui Li

270

280

Internet of Things and Cloud Computing A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data Using Apache Spark in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng Yang, Weidong Bao, Xiaomin Zhu, Ji Wang, and Wenhua Xiao

293

Contents – Part III

Task Offloading in Edge-Clouds with Budget Constraint . . . . . . . . . . . . . . . Lei He, Hongli Xu, Haibo Wang, Liusheng Huang, and Jingyi Ma Motion Trajectory Sequence-Based Map Matching Assisted Indoor Autonomous Mobile Robot Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenping Yu, Jianzhong Zhang, Jingdong Xu, and Yuwei Xu Towards the Independent Spanning Trees in the Line Graphs of Interconnection Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baolei Cheng, Jianxi Fan, Xiaoyan Li, Guijuan Wang, Jingya Zhou, and Yuejuan Han

XIII

311

327

342

POEM: Pricing Longer for Edge Computing in the Device Cloud . . . . . . . . . Qiankun Yu, Jigang Wu, and Long Chen

355

Mobility Analysis and Response for Software-Defined Internet of Things. . . . Zhiyong Zhang, Rui Wang, Xiaojun Cai, and Zhiping Jia

370

DStore: A Distributed Cloud Storage System Based on Smart Contracts and Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jingting Xue, Chunxiang Xu, Yuan Zhang, and Lanhua Bai Towards an Efficient and Real-Time Scheduling Platform for Mobile Charging Vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qi Liu, Jinyang Li, Xiaoshan Sun, Junjie Wang, Yang Ning, Wei Zheng, Jian Li, and Hengchang Liu SoProtector: Securing Native C/C++ Libraries for Mobile Applications . . . . . Ning Zhang, Guangquan Xu, Guozhu Meng, and Xi Zheng CloudPT: Performance Testing for Identifying and Detecting Bottlenecks in IaaS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ameen Alkasem, Hongwei Liu, and Decheng Zuo Smart Grid Power Trading Based on Consortium Blockchain in Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong Zheng, Kaixin Deng, Yinghui Zhang, Jiangfan Zhao, Xiaokun Zheng, and Xinwei Ma

385

402

417

432

453

Energy-Efficient Offloading in Mobile Edge Computing with Edge-Cloud Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Long, Jigang Wu, and Long Chen

460

Quantitatively Investigating Multihop Localization Errors in Regular 2-D Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bing Jia, Baoqi Huang, Tao Zhou, and Wuyungerile Li

476

XIV

Contents – Part III

Optimizing WiFi AP Placement for Both Localization and Coverage . . . . . . . Yu Tian, Baoqi Huang, Bing Jia, and Long Zhao

489

PLZMA: A Parallel Data Compression Method for Cloud Computing . . . . . . Xin Wang, Lin Gan, Jingheng Xu, Jinzhe Yang, Maocai Xia, Haohuan Fu, Xiaomeng Huang, and Guangwen Yang

504

A Caching-Based Parallel FP-Growth in Apache Spark . . . . . . . . . . . . . . . . Zhicheng Cai, Xingyu Zhu, Yuehui Zheng, Duan Liu, and Lei Xu

519

Contextual-Field Supported Iterative Representation for Face Hallucination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kangli Zeng, Tao Lu, Xiaolin Li, Yanduo Zhang, Li Peng, and Shenming Qu A Cancelable Multi-Biometric Template Generation Algorithm Based on Bloom Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin You and Xun Li Streaming ETL in Polystore Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nabila Berkani and Ladjel Bellatreche Communication-Aware Prediction-Based Online Scheduling in High-Performance Real-Time Embedded Systems . . . . . . . . . . . . . . . . . . Baptiste Goupille-Lescar, Eric Lenormand, Nikos Parlavantzas, and Christine Morin

534

547 560

575

Predicting SDC Vulnerability of Instructions Based on Random Forests Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LiPing Liu, LinLin Ci, and Wei Liu

593

Hybrid Cloud Architecture for Cross-Platform Interoperability in Smart Homes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Tao, Chao Qu, Wenhong Wei, Bin Zhou, and Shuqiang Huang

608

Conflict-Free Block-with-Stride Access of 2D Storage Structure . . . . . . . . . . Rui Song, Guozhao Zeng, Sheng Liu, and Haiyan Chen

618

Graph-Based Indoor Localization with the Fusion of PDR and RFID Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jie Wu, Minghua Zhu, Bo Xiao, and Yunzhou Qiu

630

UAV 3D Mobility Model Oriented to Dynamic and Uncertain Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Na Wang, Nan Di, Fei Dai, and Fangxin Liu

640

Contents – Part III

Acquiring Hidden Space via Modifying Block Bitmap for Android Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wang Lianfang, Huang Hong, Li Yuanzhang, and Zhang Li

XV

651

Interest Relevance-Based Caching Design in Content-Centric Networking . . . Guozhi Zhang, Jiqiang Liu, Xiaolin Chang, and Yang Yang

661

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

673

Big Data and Information Processing

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation in Cooperative Cognitive Radio Networks Xinxiang Zhang, Jigang Wu(B) , and Long Chen Guangdong University of Technology, Guangzhou 510006, China zxx [email protected], [email protected], [email protected]

Abstract. Cooperative cognitive radio networks have been proposed to address spectrum starvation problem and enhance the transmission rate of mobile devices. Most works assume one user could afford the whole spectrum and neglect the selfishness nature, which is not practical. Based on group-buying, a two-stage auction mechanism named TAMSA is proposed to guarantee the quality of service and improve the utilization ratio of spectrum resources. TAMSA is an incentive mechanism involving the primary users (P U s) and relay nodes. TAMSA can also reduce the cost of the secondary users (SU s) and increase utilities for both P U s and relay nodes. In the first stage, SU s submit their budgets, valuations and demands for spectrum resources to relay nodes in groupbuying, relay nodes calculate revenues and determine the winning SU s. In the second stage, we execute VCG auction between the relay nodes and P U s, with a maximum-weighted-matching algorithm. TAMSA can effectively allocate spectrum resources to meet the demands of SU s. We show that TAMSA is truthful, individual rational and computational efficient. Extensive simulation results show that TAMSA outperforms random algorithm by 256% in terms of average utility of P U s. TAMSA is able to improve the average utility of SU s and relay nodes significantly up to 213% and 10 times respectively. TAMSA is further improved by 28.33% and 78.65% in terms of average utility of P U s over TASG and TACC, respectively. Keywords: Spectrum allocation · VCG auction Incentive mechanism · Cooperative cognitive radio networks

1

Introduction

With the explosive growth of smart phones, wearable devices and Internet of Things (IoT), they are demanding for higher data rates and lower latency. Spectrum resource is one of the most valuable resources for wireless communication devices. However, many spectrum resources have been allocated to licensed users. On one hand, existing un-used spectrum resources have become scarce. On the other hand, some used spectrum resources have not been fully utilized, such as c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 3–16, 2018. https://doi.org/10.1007/978-3-030-05057-3_1

4

X. Zhang et al.

radio and TV-channel, resulting in spectrum cavitation [1–3]. Cognitive radio is proposed to solve the above problems to guarantee of Quality of Service (QoS) for mobile devices and improve the utility ratio of spectrum resources. To enhance the performance of cognitive radio networks (CRNs), cooperative cognitive radio networks (CCRNs) was proposed [4]. In CCRNs, there are two kinds of users, one is the spectrum holder, that is, the primary user (licensed user), denoted as P U s. The other is the secondary user (unlicensed user), represented by SU s [5]. The mobile devices with cognitive function can dynamically detect and utilize the idle spectrum resources. And the CCRNs allows SU s to access the licensed spectrum occupied by P U s to improve spectrum utilization [6,7], but SU s must not cause strong interference to the normal communication of P U s. CCRNs can improve the utilization ratio of spectrum resources by spectrum reuse. Auction plays an important role in spectrum resources allocation since there have been numerous researches on spectrum allocation using auctions [8–10]. Most prior works design single-seller and multi-buyer auctions with homogeneous channels. In [1] and [4], authors design truthful auction for trading homogeneous channels between a seller and multiple SU s. Besides, a distributed resource allocation algorithm is adopted, and direct or cooperative transmission can be selected with multiple sellers and multiple buyers [5]. Many studies assume that P U s are willing to share their idle spectrum resources, In reality, P U s are usually selfish, hence it is necessary to provide incentives for P U s to participate in. Vickrey-Clarke-Groves (VCG) auction guarantees the truthfulness of the auction process, which provides a new idea for resources allocation and can effectively guarantee the economic returns of the participants. A McAfee based auction mechanism is proposed, which considers the cooperative transmission of relay nodes and ensures the maximum benefit of P U s, but it does not consider the revenues of relay nodes [7]. In existing works [11–13], authors propose VCGbased auction mechanism to maximize the utility of P U s and guarantee the truthfulness. However, the objective is to maximize the amount of P U s neglect the specific demands of SU s for spectrum resources. In recent years, double auction [10], and combinatorial auction [11] have been considered in spectrum resources allocation. However, most works neglect data transmission cooperatively by relay nodes. Inspired by the popular group buying services on the Internet, authors in [13] and [14] propose auction algorithms based on group buying, which encourages SU s to be voluntarily grouped together to acquire the spectrum resources in spectrum auctions. The group buying algorithm can effectively reduce the payment of the SU s. In [12–14], they equally distribute the spectrum resources to the winning SU s. Besides, In [15], a multiple input and multiple output method is proposed in CRNs with cooperative communication. It allows SU s to help data transmission for P U s and obtain the opportunity to transmit data for themselves, but the mechanism has a higher requirement of hardware configuration. In this work, we reduce the payment of SU s with group buying. We allocate spectrum resources according to the specific demands of the SU s.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

5

In order to effectively allocate spectrum resources and encourage P U s to share spectrum resources in auction we designed, we have to solve the following challenges. (1) Running applications on mobile devices are heterogeneous, so the budget and demand for each SU s are different. Besides, how to reduce the cost of SU s is a challenge. (2) For spectrum holders and relay nodes should be incentivized because of selfishness nature. Therefore, how to provide incentives should be designed for both P U s and relay nodes. (3) Auction should be truthful, budget balance, individual rational and computational efficient. Hence, the auction mechanism should ensure the above properties. Different from the previous works, we focus on investigating an incentive auction mechanism for efficient spectrum resource allocation in CCRN. TAMSA provides an incentive for both P U s and relay nodes to participate in auction. Besides, in the scenario, TAMSA is based on group buying to reduce the payments of SU s, and TAMSA allocates spectrum resources according to the specific demands of SU s. The main contributions of this work are summarized as follows. • To reduce the payment of SU s effectively, we propose an auction algorithm based on group buying for the specific demands of spectrum resources. The auction mechanism is applicable to heterogeneous networks. The economic properties, truthfulness, budget balance, individual rationality and computational efficiency are proved. • We design an incentive mechanism to encourage spectrum holders to share their idle spectrum resources, and encourage relay nodes to transmit data cooperatively. • Numerous numerical results demonstrate that TAMSA is superior to the algorithm Random by 256% in terms of average utility of P U s. The average utility of relay nodes and SU s in TAMSA outperforms Random by 10 times and 213% respectively. TAMSA is further improved by 28.33% and 78.65% in terms of average utility of P U s over TASG and TACC, respectively.

2

System Model and Problem Formulation

In this section, we not only focus on the system model, but also formulate the problem to be studied. And we introduce the related economic properties that auction scheme should be followed. The basic notations as shown in Table 1. 2.1

System Model

In this paper, we consider a cognitive network with multiple primary users and multiple secondary users. Besides, in order to improve the channel transmission rate, we take the relay node into account. In this scenario, as in [16], we assume all nodes stay static in a given auction period. TAMSA scheme aims to maximize the social welfare in a spectrum auction, which also encourages both P U s and SU s to participate in. To maximize the utilization of spectrum resources, the incentive mechanism should properly assign the matching between the spectrum

6

X. Zhang et al. Table 1. Notations for system model. Notations Meaning P Us

Set of primary users

SU s

Set of secondary users

Ri

The ith relay node, where i ∈ [1, M ]

Si

The ith group, where i ∈ [1, ni ]

sji

The jth secondary user in the ith group, 1 ≤ i ≤ M, 1 ≤ j ≤ ni

dji (k) bji (k) vij (k)

Demand of sji for kth Channel (P Uk ), 1 ≤ k ≤ M

Ak

Ask or reserve price of kth Channel

Siw Riw

Set of winning secondary users, 1 ≤ w ≤ ni

The bid of sji for kth Channel

The valuation of sji for kth Channel

Set of winning relay nodes

P Uiw pji (k)

The payment of sji for kth Channel

pc (k)

The clearing price

Fi (k)

Si (k) s payment for kth relay node

Pi (k)

The ith relay node Ri (k) s payment for P Uk

Bi (k)

The bid of the ith relay node Ri (k) for P Uk

uji

The utility of sji

UP Uk

The utility of P Uk

URk

The utility of Rk

Set of winning primary users

resources and the demands of SU s. Trading between P U s and SU s should meet certain requirements to benefit both parties, so P U s need to be incentivized to provide resources, and the demands of SU s should be satisfied. The proposed network model is shown in Fig. 1, which is a hierarchical auction consisting of m P U s and ni SU s. The P U s possess M heterogeneous channels, and each primary user has a reserved price Ak , where k ∈ [1, M ], which is the lowest price the P Ui is willing to sell the kth channel. The P U s have different reserved prices Ak for spectrum, and we assume each relay node can buy at most one spectrum. In the ith group Si , where i ∈ [1, M ], there are n SU s and Si = s1i , s2i , · · · , sni , n ∈ ni . Each sji has a bid or budget bji (k) and a valuation vij (k) for the kth channel P Uk . And in order to improve the utilization of spectrum resources, each sji submits the demand for spectrum dji (k) to the P Uk . The spectrum resource is allocated according to the specific demands of the SU s.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

7

Fig. 1. Auction model.

We design an incentive mechanism to improve the utilities of P U s and relay nodes. TAMSA is a two-stage hierarchical auction, consisting of two-single round sealed bid auctions, called stage I auction and stage II auction respectively. In stage I, auction goes between relay nodes and the group of secondary users Si , and in stage II, the auction conducts between P U s and relay nodes Ri , and the P U s sell their spectrum resources to relay nodes. The relay node Ri (k) gathers bid and demand from the ith group Si . Then system model executes the stage II auction. Ri (k) submits the bid Bi (k) to P Uk , and P Uk gives the reserve price Ak , where k ∈ [1, M ], noting that Bi (k) ≥ Ak . The relay node Ri (k) determines the winners in group Si (k) after gathering the ith group member’s bids, and the set of winning SU s is denoted by Siw (k), where Siw (k) ⊆ Si , and the gathered bid is Fi (k). We assume that each group pays for at most one relay node at the same time, because one relay node serves for multiple groups might cause transmission delay. If it wins in this auction, relay nodes Ri will allocate spectrum resources to the Siw (k). 2.2

Problem Formulation

The system will determine the payment of winners. To achieve fairness, payments of winners should be proportional to the workloads of the demands. The payment of sji (k) is formulated as pji (k) = pc (k) · dji (k), 1 ≤ i ≤ M, 1 ≤ j ≤ ni and 1 ≤ k ≤ M,

(1)

8

X. Zhang et al.

where pc (k) is the clearing price. Let uji denote the utility of secondary user sji , for each sji ∈ Siw . Accordingly, the utility of sji is defined as  j vi (k) − pji (k), if sji ∈ Siw and pji (k) ≤ bji (k) j ui = (2) 0, otherwise. What calls for special attention is that the payment of sji (k) should not be higher than the budget bji (k), k ∈ [1, M ], sji ∈ Siw . The relay node Ri (k) calculates the finance Fi (k) collected from SU s. Hence the utility of relay node Ri is  Fi (k) − Pi (k), if Ri (k) ∈ Riw URi = (3) 0, otherwise. Where Pi (k) is the payment of relay node for P U s. In order to encourage spectrum holders to share spectrum resources, each P Uk has a reserved price Ak . The payment of relay nodes Pi (k) should be higher than the reserved price Ak , so the utility of P Uk is defined as  Pi (k) − Ak , if P Uk ∈ P Ukw and Ri ∈ Riw UP Uk = (4) 0, otherwise. In this auction, the spectrum owners P Uk s allocate spectrum resources to SU s. The speed of channel transmission is increased by the relay nodes cooperatively. 2.3

Economic Properties

In this section, we present several economic properties clearly that we would like to achieve. In an auction, it will not be executed until the economic properties are satisfied. Definition 1 (Truthfulness). An auction is truthful. If it is a dominant strategy, any participant’s utility will be maximized for the bidder’s true valuation, and no bidder can improve its utility by misreporting its valuation. In this paper, it implies the auction mechanism designed by us. Each sji submits true valuation to Ri , and each relay node Ri show its true valuation to the kth primary user P Uk . Definition 2 (Budget Balance). An auction is in budget balance for participators if total payment from buyers are greater than the total revenue of sellers. In our mechanism, the auction is conducted in the form of group in tier I auction. We ensure the utilities of auctioneers are nonnegative. We make sure that the payments that the relay nodes receive from the group are no less than the amount paid to the P U s. Definition 3 (Individual Rationality). An auction is individual rational. The utility of each participant is nonnegative. In TAMSA scheme, the utilities of SU s, relay nodes Ri and P U s are nonnegative. That is, uji , URi and UP Uk are nonnegative. Definition 4 (Computational Efficiency). An algorithm is computational efficient if the mechanism can terminate in polynomial time. In our auction mechanism, the selection of winning SU s, the matching of P U s and relay nodes, and the clearing price and payment can be completed in polynomial time.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

3

9

Two-Stage Auction Mechanism

In this section, we propose a truthful two-stage auction framework called TAMSA for cognitive radio networks shown in Fig. 1. TAMSA consists of two sub-auctions, which satisfies these properties: truthfulness, budget balance, individual rationality and computational efficiency. 3.1

Stage I Auction

In this stage, the ni secondary users are randomly divided into multiple groups. The groups submit their bids or budgets to relay nodes separately. Relay nodes will conduct the auction and decide the winning group members virtually. Then relay nodes calculate the payment of each winner and determine the final winners. It will allocate channels to SU s if it gets spectrum resources in tier II auction. We first introduce the algorithm to buy the spectrum by group and decide the winners (GBDW), the details are as follows. Firstly, relay node Ri collects the bid vector b1i , b2i , · · · , bni i , demand d1i , d2i , · · · , dni i and valuation vi1 , vi2 , · · · , vini from SU s in Si as previous mentioned. We design an algorithm to calculate the budget vector Fi (k) for P Uk . Then, relay nodes decide the winner in the best performance ratio way and calculate the optimal unit price for each group. The relay node Ri sells at most 1/2 time fraction to the Si for maximizing the revenue. Inspired by the work in [16], we sort the vector of b/d in descending, then we can get the optimal unit price for group Si , denoted as OP T (b/d), OP T (b/d) = max i 1≤i≤|b|

bi , di

(5)

where |b| denotes the length of the array, bi and di denote the ith budget and demand separately. The detail of the algorithm is shown in Algorithm 1. It should be noted that the clear price is extracted from the group to ensure truthfulness. Relay nodes select the maximum integer m by OP T (b/d), and then eliminate m SU s with smallest budget and lowest valuation. Fi (k) is the gathered bid from those winning SU s, and the P Uk charges Ri (k) less than Fi (k) for trading the kth channel. In the example, we will show how Algorithm calculates the clearing price and determines the winner. We assume that there are 5 SU s in group i, and their budget and demands vector are as follows: b = {2, 3, 7, 6, 8}, d = {1, 2, 3, 2.5, 4}, so b/d = {2, 1.5, 2.33, 2.4, 2}, which can be obtained by Algorithm 1. We sort b/d in descending and calculate OP T (b/d) to get the maximum m, hence we can get m = 4 and the clearing price is pc = 8/4 = 2. Si participates in the auction need to pay to the ith relay node is p1i = pc × d1i = 2 × 1 = 2. In the same way, the payment of the other 4 secondary users can be calculated separately, which is 4, 6, 5 and 8. Therefore, the winners in ith group are s1i , s3i , s4i and s5i , and the amount collected by the ith relay node is 21.

10

X. Zhang et al.

Algorithm 1. GBDW : Group Buying and Decide Winners Input: Sorted vector of b/d and the valuation. Output: The revenue of relay nodes, Siw and the payment of secondary users. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

3.2

Let 1 ≤ m ≤ ni − 1 be a bid-independent integer. Search for the maximum m in b/d gets the maximum OP T (b/d). pc = b m i /m. Siw (k) ← ∅ Fi (k) ← ∅ for j ← 1 to ni do pji (k) ← pc · dji (k), if pji (k) < bji (k) and pji (k) < vij (k) if pji (k) < bji (k) and pji (k) < vij (k) then Siw (k) ← Siw (k) ∪ sji (k) Fi (k) ← Fi (k) + pji (k) end if end for return Fi (k), Siw (k).

Stage II Auction

In this procedure, auction conducts between P U s and relay nodes, and relay nodes compete for idle spectrum resources of P U s. According to previous research, McAfee auction mechanism cannot be utilized since it only suits for the scenario where there are homogeneous goods to trade [17]. In order to ensure the truthfulness of auction mechanism and apply to heterogeneous networks, we design a spectrum resource allocation algorithm SRA based on VCG auction mechanism. The detail of SRA is shown in Algorithm 2. We apply VCG-based auction mechanism to maximize the social welfare, that is, the total utility of all the participating bidders. Relay node assigns spectrum resource to the Siw when it wins the primary user. Relay node Ri needs to pay for winning P Uk the reward Pi , which is calculated by algorithm SRA. We use the bid of relay node Bi (k) and the reserve price Ak to construct a weighted complete bipartite graph, and the weight is (Bi (k) − Ak ). MaximumWeighted-Matching (MWM) can optimize all utility of participator in this auction. To ensure the truthfulness of auction, we apply VCG-based auction to calculate payments of relay nodes. The details are as follows.

4

Theoretical Analysis

In this section, we prove that TMASA satisfies the truthfulness, individual rationality, budget balance and computational efficiency. Theorem 1. TAMSA is truthful in the network. Proof. In the following, we focus on proving the dominant strategy for SU s. For buyer, sji (k) ∈ Si , it will submit its true bid and demand, because it reflects its true demand for spectrum resource.

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

11

Algorithm 2. SRA : Spectrum Resource Allocation Input: Bi (k), Ak , f or∀1 ≤ i ≤ ni and 1 ≤ k ≤ M . Output: Rw ,P U w ,Pi . 1: W ← ∅, E ∗ ← ∅, Pi ← ∅//W is the edge set in the matching graph. 2: Create a weighted complete bipartite graph G = (R, P U, W, w) and the weight of w(Ri , P Uk ) = Bi (k) − Ak if Bi (k) ≥ Ak . 3: E ∗ ← M aximum − W eighted − M atching(W ). 4: for each (Ri , P Uk ) ∈ E ∗ do 5: Rw ← Rw ∪ {Ri }, P U w ← P U w ∪ {P Uk } 6: W  ← W \(Ri , P Uk ), R \ ← R\{Ri } 7: G−i ← (R , P U, W  , w) ∗ ← M aximum − W eighted − M atching(W  ) 8: E−i ∗ ) − (w(E ∗ ) − w(Ri , P Uk )) + Ak 9: Pi ← w(E−i 10: end for 11: return Rw ,P U w ,Pi .

For sji ∈ Si , it can improve its utility by changing its valuation and budget from the first branch of Eq. (2). Besides, inspired by [18], the clearing price pc is randomly generated by the optimal price ratio. For Ri ∈ Rw , it can obtain the maximum utility max(Bi (k) − Ak ) if it gets / Rw , it will fail during this auction the spectrum resource in this auction. If Ri ∈ and cannot get the spectrum resource, because (Bi (k)−Ak ) < 0. If relay node Ri submits untruthful bid, the result will not change, when Bi (k) < Fi (k). When Bi (k) > Fi (k), the utility of relay node URi (k) = Fi (k)−Pi = Fi (k)−Bi (k) ≤ 0, if it submits untruthful bid. Therefore, both relay nodes and SU s cannot improve their utility by submitting untruthful bids. Theorem 2. TAMSA is individual rational and budget balance. Proof. For SU s, the utility of sji (k) is calculated by vij (k) − pji (k) > 0, for ∀sji ∈ Siw , and we have proved the individual rationality of SU s. Then we prove relay nodes are also individual rational. For relay node Ri , the minimum payment price for relay node Ri (k) is Ak for ∀Ri ∈ Riw , Bik ≤ Fi (k) and P Uk ∈ P U w . Besides, the utility of primary user UP Uk = Fi (k) − Pi ≥ Bi (k) − Ak > 0. Therefore, both buyers and sellers are willing to participate in the auction. They can all gain nonnegative utility, and TAMSA mechanism is of individual rationality and budget balance. Theorem 3. TAMSA is computational efficient. Proof. We now analyze the time complexity of algorithm TAMSA. In Algorithm 1, the time complexity of the sorting process is O(ni log ni ). In Algorithm 2, it takes O(max{ni , M }3 ) time by applying the algorithm maximum-weightedmatching. The time complexity of computing the payment is O(ni max{ni , M }3 ). Hence, TAMSA is computational efficient.

12

X. Zhang et al.

5

Numerical Results

In this section, we evaluate the performance of TAMSA. In heterogeneous network structure of we designed, this is the first incentive scheme proposed for the specific demands of second users and there are no existing auction schemes to compare with. Instead, We design the upper bound (Upper) and a random algorithm (Random) for TAMSA to compare with. Meanwhile, we also simulate algorithms TASG and TACC to compare with. The algorithm Upper uses the bids of buyers as the payment to maximize the revenue. In TASG and TACC, secondary users are divided into two sets randomly and selected the winning set from other side. TASG is based on VCG mechanism, and TACC sorts the reserve price Ak of primary users in ascending order and the budget Bi (k) of relay nodes in descending order. The experiment tool is MATLAB, and the results are averaged for 100 repetitions. We consider a heterogeneous network shown in Fig. 1. We assume that the number of P U s is M = 5, and there are 5 relay nodes to participate in this auction, and the number of SU s ni varies from 20 to 120 with an increment of 20. We assume that the valuation of secondary users vij (k) and budget bji (k) are uniform distribution, and their ranges are denote as U (50, 150) and U (5, 10) respectively. The reserve price Ak comply with U (10, 20) following [15–18]. 5.1

Simulation Results

We first investigate the running time of TAMSA, and the results are shown in Figs. 2 and 3. From Fig. 2, we can see that the running time is no more than 0.35 s even if the amount of SU s becomes large, i.e., when there are 120 SU s. For Fig. 3, we can see algorithm Random runs fastest, since the algorithm Random selects winning secondary users Siw randomly. 0.4 0.3

0.3

0.3 0.25

0.2

0.2 0.15 0.15

0.1 0.05

Running time(s)

0.25

0.35

Running Time(s)

0.35

Random TAMSA TASG TACC

0.25 0.2 0.15 0.1

0 30

0.1 20

Nu

mb

er

of

10

PU

s

0

20

40

80

60

ber

Num

100

120

Us of S

Fig. 2. Running time of TAMSA (a).

0.05

0.05 0

10

20

30 Number of SUs

40

50

Fig. 3. Running time of TAMSA (b).

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

13

For TACC auction mechanism, the reserve price of primary user Ak is sorted in ascending order, and the budget of relay nodes Bi (k) is in descending order to guarantee the utility of P U s. Besides, TACC needs to match every primary user and relay node, so algorithm runs the slowest. The running time of algorithm TAMSA and algorithm TASG is not large, because they use maximum-weightedmatching algorithm to complete matching between the winning P U s and relay nodes. Next, to validate Theorem 2 regarding individual rationality and budget balance of TAMSA, we show the truthfulness in Fig. 4. In this auction, the payment of relay nodes Pi (k) is not higher than the collected from SU s Fi (k), and each winning primary user P Uiw receives a payment not less than its reserve price Ak from the auctioneer. From the experimental results in Fig. 4, we can see that the utility remains nonnegative when relay nodes submit truthful bids. But when relay node submits an untruthful bid, its utility rapidly reduce and will continue to be negative. Figure 4 depicts the difference of utility. Relay nodes submit truthful bid, when the bid of relay node is less than 50. When the bid is greater than 50, the difference between truthful and untruthful bids is presented. The utility of relay nodes and P U s are nonnegative, because the bid of relay node is less than the collected from SU s, and the bid is greater than the reserve price of P U s, Bi (k) ≤ Fi (k) and Bi (k) ≥ Ak , when relay nodes submit truthful bids. The utility of relay node is negative, that is Bi (k) > Fi (k), if it submits an untruthful bid. In summary, as seen in Fig. 4, the utility of relay nodes cannot be improved by submitting untruthful bid. 80

9

Truthful unTruthful

8

60

7

Utility of PUs

Utility of relay nodes

40

20

0

−20

6 5 4 3

−40

−60 10

Upper Random TAMSA TACC TASG

2

20

30

40

50 60 70 Bids of relay nodes

80

90

Fig. 4. Truthfulness of TAMSA.

100

1 20

30

40

50

60 70 Number of SUs

80

90

100

Fig. 5. Average utility of PUs with the number of SUs.

Figure 5 shows how the utility of primary users UP Uk varies with the number of SU s. With the increasing number of SU s, the average utility of P U s calculated by the five algorithms is gradually increasing. On average, the proposed algorithm TAMSA in this paper has improved 256% on the utility of P U s compared with the algorithm Random. TASG is about 217% better than algorithm

14

X. Zhang et al.

Random, TACC achieves about 156% utility gains than the algorithm Random on the UP Uk . TAMSA is further improved up by 28.33% and 78.65% over TASG and TACC in terms of average utility of P U s, respectively. That’s because both algorithm TAMSA and TASG apply the maximum weighted matching algorithm to match P U s and relay nodes to ensure the maximum benefit. Besides, both TAMSA and TASG use the auction mechanism based on VCG to ensure the truthfulness of algorithm. The difference between TASG and TAMSA is that TAMSA selects the winning set of SU s with the optimal cost performance, and TASG selects the winning set with the subset of SU s’s bid by another subset. The optimal cost performance can enhance the revenue of P U s. In TACC, although the utility of P U s can be increase, it cannot guarantee the maximization of its earnings. 50 45 40

4.5

Upper Random TAMSA TACC TASG

4

3.5

30

Utility of SUs

Utility of Relays

35

25 20 15

Upper Random TAMSA TACC TASG

3

2.5

2

10 1.5 5 0 20

30

40

50

60 70 Number of SUs

80

90

100

Fig. 6. Average utility of Relay nodes with the number of SUs.

1 20

30

40

50

60 70 Number of SUs

80

90

100

Fig. 7. Average utility of SUs with the number of SUs.

Figure 6 depicts the average utility of relay nodes with the varying number of SU s. We can see that TAMSA outperforms Random by about 10 times averagely, TASG and TACC are about 7 times and 6.6 times better than Random algorithm respectively. TAMSA is further improved up by 44.59% and 64.22% over TASG and TACC in terms of average utility of relay nodes, respectively. The reason is that both TAMSA and TASG use the VCG auction mechanism to calculate the payment of relay nodes Pi (k). In Algorithm 2, we see that the payment of relay node is effectively reduced on the premise of guaranteeing the primary user’s revenue, so the utility of relay nodes is improved. Figure 7 shows the relationship between the average utility of SU s and the number of SU s. The average utility of SU s in TAMSA outperforms Random by 213%, TAMSA is able to improve the average utility of SU s in TASG up to 181%, and TACC achieves about 115% utility gain than the Random algorithm on the utility of SU s. TAMSA is improved up by 16.99% and 85.73% over TASG and TACC in terms of average utility of SU s, respectively. That’s because TAMSA selects the winning set Siw in optimal cost performance. The payment of SU s is calculated according to their specific demands, so TAMSA effectively improves

TAMSA: Two-Stage Auction Mechanism for Spectrum Allocation

15

the utility of SU s. Algorithms TASG and TACC calculate the payment of SU s with the subset of SU s’s bid by another subset. TASG adopts the optimal singleprice auction to reduce the payment of SU s. In TACC, the payment of SU s is the average value of the winning SU s. From the above experiments, we can see that TAMSA is suitable for the heterogeneous network where the utility of participants can be maximized at the same time. Algorithm TAMSA can gain higher social welfare than the algorithms Random, TASG and TACC. Hence, TAMSA can be deployed to the real situations, and it can effectively improve the utilization of spectrum resources.

6

Conclusion

In this paper, we have proposed a two-stage truthful auction mechanism for spectrum allocation (TAMSA) in cognitive radio networks with multiple primary users, multiple secondary users and relay nodes. We have investigated an incentive mechanism to encourage the spectrum holders to share their idle spectrum resources and encourage the cooperative transmission of the data to improve the utilization of the spectrum resources. TAMSA is a two-stage auction mechanism. In the first stage, SU s submit budgets and valuations for spectrum resources to relay nodes. Relay nodes calculate the payment of SU s and determine the winning set Siw . In the second stage, relay nodes submit bids to P U s to compete for spectrum resources. We have proved that TAMSA satisfies properties such as truthful, individual rational and computational efficient. Extensive simulation results show that TAMSA outperforms random algorithm by 256% in terms of average utility of P U s. TAMSA is able to improve the average utility of SU s and relay nodes significantly up to 213% and 10 times respectively. The performance of TAMSA is further improved by 28.33% and 78.65% in terms of average utility of P U s over TASG and TACC, respectively. Numerical results validated our theoretical analysis and demonstrated improvement in auction mechanism efficiency. Acknowledgment. This work was supported by the National Natural Science Foundation of China under Grant Nos. 61702115 and 61672171, Natural Science Foundation of Guangdong, China under Grant No. 2018B030311007, and Major R&D Project of Educational Commission of Guangdong under Grant No. 2016KZDXM052. This work was also supported by China Postdoctoral Science Foundation Fund under Grant No. 2017M622632.

References 1. Zheng, Z., Wu, F., Tang, S., et al.: AEGIS: an unknown combinatorial auction mechanism framework for heterogeneous spectrum redistribution in noncooperative wireless networks. IEEE/ACM Trans. Netw. 24(3), 1919–1932 (2016) 2. Zhu, Y., Li, B., Li, Z., et al.: Truthful spectrum auction design for secondary networks. In: INFOCOM, pp. 873–881. IEEE, Orlando, FL, USA (2012)

16

X. Zhang et al.

3. Chen, L., Huang, L., Xu, H., et al.: Optimal channel allocation for multi-PU and multi-SU pairs in underlay cognitive radio networks. Int. J. Ad Hoc Ubiquitous Comput. 27(1), 19–33 (2018) 4. Wang, X., Huang, L., Xu, H., et al.: Truthful auction for resource allocation in cooperative cognitive radio networks. In: 24th International Conference on Computer Communication and Networks, pp. 1–8. IEEE, Las Vegas, NV, USA (2015) 5. Wang, X., Huang, L., Xu, H., et al.: Social welfare maximization auction for secondary spectrum markets: a long-term perspective. In: 13th IEEE International Conference on Sensing, Communication, and Networking, Communication, and Networking, pp. 1–9. IEEE, London, UK (2016) 6. Shen, F., Li, D., Lin, P.H., et al.: Auction based spectrum sharing for hybrid access in macro-femtocell networks under QoS requirements. In: IEEE International Conference on Communications, pp. 3335–3340. IEEE, London, UK (2015) 7. Wang, H., Liu, Z., Cheng, Z., et al.: Maximization of link capacity by joint power and spectrum allocation for smart satellite transponder. In: 23rd Asia-Pacific Conference on Communications, pp. 1–6. IEEE, Perth, WA, Australia (2017) 8. Jia, J., Zhang, Q., Zhang, Q., et al.: Revenue generation for truthful spectrum auction in dynamic spectrum access. In: 10th ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 3–12. ACM, New Orleans, Louisiana, USA (2009) 9. Liu, Y., Tao, M., Huang, J.: An auction approach to distributed power allocation for multiuser cooperative networks. IEEE Trans. Wirel. Commun. 12(1), 237–247 (2012) 10. Shi, W., Zhang, L., Wu, C., et al.: An online auction framework for dynamic resource provisioning in cloud computing. IEEE-ACM Trans. Netw. 24(4), 2060– 2073 (2016) 11. Feng, Z., Zhu, Y., Zhang, Q., et al.: TRAC: truthful auction for location-aware collaborative sensing in mobile crowdsourcing. In: INFOCOM, pp. 1231–1239. IEEE, Toronto, ON, Canada (2014) 12. Wu, F., Vaidya, N.: A strategy-proof radio spectrum auction mechanism in noncooperative wireless networks. IEEE Trans. Mob. Comput. 12(5), 885–894 (2013) 13. Lee, C., Wang, P., Niyato, D.: A real-time group auction system for efficient allocation of cloud internet applications. IEEE Trans. Serv. Comput. 8(2), 251–268 (2015) 14. Lin, P., et al.: Groupon in the Air: A three-stage auction framework for Spectrum Group-buying. In: INFOCOM, pp. 2013–2021. IEEE, Turin, Italy (2013) 15. Advaita, A., Gali, M.M., Chu, T.M.C., et al.: Outage probability of MIMO cognitive cooperative radio networks with multiple AF relays using orthogonal spacetime block codes. In: Wireless and Mobile Computing, Networking and Communications (WiMob), pp. 84–89. IEEE, Rome, Italy (2017) 16. Yang, D., Xue, G., Zhang, X.: Group buying spectrum auctions in cognitive radio networks. IEEE Trans. Veh. Technol. 66(1), 810–817 (2017) 17. Yang, D., Fang, X., Xue, G.: Truthful auction for cooperative communications. In: IEEE International Conference on Communications, pp. 1–10. IEEE, Ottawa, ON, Canada (2011) 18. Chen, L., Wu, J., Zhang, X.X., et al.: TARCO: two-stage auction for D2D relay aided computation resource allocation in HetNet. IEEE Trans. Serv. Comput. PP(99), 1 (2017)

QoS-Driven Service Matching Algorithm Based on User Requirements Mengying Guo(B) and Xudong Yang School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China {mengying 1204,xdyang}@bupt.edu.cn

Abstract. Quality of Service (QoS) is an important factor which should be considered in service matching. There are two problems in most existing solutions. Firstly, most QoS models are static model described by determinate values or probability distributions, ignoring the impact of time factor. However, most QoS attributes are time-dependent, such as response time and reliability. Secondly, the service selection criteria of most QoS-driven service matching algorithms are based on service performance, but user requirements and the load of services are not considered. In this paper, we propose a Time-Segmented QoS Model (TSQM) to dynamically model QoS. Based on this model, a Service Matching algorithm based user QoS request and Priority (QPSM) is proposed. The priority of user requests is used to control the load of the services. Simulation results show that the algorithm can achieve a higher response rate and a better effect of load balancing. Keywords: Service matching · QoS Service model · Load balancing

1

· Dynamic QoS model

Introduction

SOA (Service-Oriented Architecture) has provided a possibility for IoT (Internet of Things) systems to build distributed applications by loosely coupled services [1]. IoT services can be provided for different systems as web services by this way. Selecting services in numerous registered services has become difficult with the number of IoT services increasing rapidly [2]. The characteristics of IoT services determine that service function and service quality must be taken into account simultaneously when performing service matching. QoS (Quality of service) measured in different criterions such as delay, response time, reliability, availability, cost, etc. [3], has been a crucial factor in selecting services from numerous services with the same functions. The results of service matching depend not only on the matching degree to user requirements but also on the QoS attributes of the service itself. QoS-aware service selection is a complex multi-criterion decision problem, which is called NP-hard problem, and it is still a challenging research [4]. c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 17–27, 2018. https://doi.org/10.1007/978-3-030-05057-3_2

18

M. Guo and X. Yang

There have been many reasonable selection models and effective matching algorithms for QoS-aware service selection. In these models and algorithms, service matching is considered as an optimization problem based on service selecting and the objective is to find the best service. However, the fact that actual requirements of users are not considered is unacceptable for some users, because the matched services may have the best overall performance but cannot satisfy the user requirement for a certain QoS attribute. Another problem of these models is that the QoS attributes are only represented with single-valued model or probabilistic model and the influence of time is not taken into account. Because the service QoS attributes dynamically change with time and user load, the static model cannot accurately represent the QoS values. Thereby the static model will seriously affect the accuracy of matching results. In this paper, by splitting time and dynamically modeling each time period, we propose a Time-Segmented QoS Model (TSQM) which can represent QoS attributes more accurately. Based on our model, a Service Matching algorithm based user QoS request and Priority (QPSM algorithm) is proposed. In this algorithm, the single QoS performance and comprehensive QoS performance provided by services are considered simultaneously. The load of the service is controlled according priority, so that the purpose of balancing user load on each service can be achieved. The rest of the paper is organized as follows. Section 2 introduces the related work of service matching technology. Section 3 details the TSQM model and the QPSM algorithm. Section 4 shows the simulation results to prove the feasibility and effectiveness of the QPSM algorithm. Section 5 concludes this paper.

2

Related Work

QoS-based service matching can usually be divided into two relatively independent processes, service selection and service ranking [5]. Service selection ensures the most basic functional requirements and QoS requirements of users or systems. Service ranking is a further optimization on this basis. The model and algorithm of service selection can be divided into service-function-based selection and service-quality-based selection according to different selection criteria. In service-function-based model, the concepts such as semantics or ontology are used to build service models [6,7]. The service-quality-based selection can be divided into single QoS performance selection model and comprehensive QoS performance selection model [5]. The service-quality-based selection can also be divided into single value model and probability-based selection model [8–10]. Service function is one of the requirements that should be satisfied in the process of service matching. The fundamental purpose of service matching is to select the most appropriate service for the user based on the service request from the user. More and more models describe and define services based on semantic web and ontology to understand the functional requirements of users more intelligently. A new resource model describing IoT resources in multi-dimensional was proposed in [6]. Based on this model, a resource matching algorithm, that

QoS-Driven Service Matching Algorithm Based on User Requirements

19

select suitable resource according the similarity between semantic web matching resources, was also proposed. In [7] authors proposed a QoS-based dynamic service composition method in semantic IoT. According to the context-added QoS ontology, after the dynamic semantic annotation of the services in semantic internet of things, the candidate service sets are dynamically selected and combined to provide more accurate services. Service quality is another requirement that should be satisfied in the process of service matching. The QoS attributes of services will significantly impact on the comprehensive evaluation of services. Therefore, QoS-based service selection is an available scheme of IoT service selection. In most studies, such as [8,9], single-valued model or probabilistic model are usually used to model each dimension of QoS, and the optimal services are selected according to the comparison of service performance. In the process of QoS-aware service matching, not only the overall performance of the service but also each user requirement of QoS should be considered. In [10] authors proposed a service discovery algorithm based on a multi- stage service matching algorithm. In this algorithm, each QoS attribute is assigned a different weight and the QoS constraints are determined according to user requests. Finally, the most suitable service is selected. The QoS of web service dynamically changes with factors such as network condition, user load and time. Static model constructed solely from historical data cannot accurately represent the dynamic changes. Therefore, the time factor must be considered when modeling.

3

Service Model

In a complete process of service matching, the function and quality of service should be taken into consideration. Assume that the virtual service set S is known and all services in the virtual service set S can satisfy the functional requirements requested by the user. Next, the QoS modeling and service matching will be discussed further. 3.1

Time-Segmented QoS Model Definition

The TSQM model is a time-segmented QoS-based model. According to changes of QoS attributes over time, the QoS change period can be divided into some time periods with different intervals and the QoS model can be constructed separately in each time period. Definition. The TSQM model for a service can be represented as a triple (ET, P, QM ), where • ET = [T0 , T0 + T ) is the effective period of QoS, T0 is the start time of effective period, T is the time period of QoS attribute updated.  • P = {P1 , P2 , · · · , PN } is the time period of ET , Pi = [ti , ti+1 ) and i Pi = ET .

20

M. Guo and X. Yang

• QM = Q1 , Q2 , · · · , Qn  is a sequence of QoS models, Qi = (fDELAYi , fRESTi , fRELi , fU SAi , fCOSTi ) is the QoS vector of the time period Pi , and fDELAYi , fRESTi , fRELi , fU SAi , fCOSTi represent the probability distribution function of delay, response time, reliability, availability, and cost. Given a service, the QoS model of the service can be represented as Q(t) = (fDELAYt , fRESTt , fRELt , fU SAt , fCOSTt ), where t ∈ [ti + kT, ti+1 + kT ) , k = 0, 1, · · · The TSQM model shows that the QoS of the service changes with time. The model can be flexibly extended according to different user requirements, and the number of QoS attributes in each time period can be one or more. In this paper, delay, response time, reliability, availability and cost are selected as the QoS attributes. 3.2

Detailed Description of the Model

QoS Model. A QoS model of a service contains k QoS attributes. These attributes can be 5 non-functional attributes defined in the TSQM model, and they can also be extended according to user requirements. The QoS of service Si corresponds to a set of QoS vectors consisting of a probability distribution function at each time period. In order to compare the QoS performance more easily, the probability distribution function in each time period should be converted into a determined value using the 999 criterion (choose a value that 99.9% of the data satisfies as the QoS value of the current time period), i.e., fQoSi → qi . For clear expression, the below-mentioned QoS attributes default to QoS attributes within a certain time period. The QoS attributes of service Si can be represented as a vector, i.e., Qi = (qi1 , qi2 , · · · , qik ), where qik is a value converted from the probability distribution function of the k-th QoS attribute. We assume that the virtual service set consists of n candidate services, S = {S1 , S2 , · · · , Sn }, and their corresponding QoS attributes can be represented as an n × k matrix. ⎡ ⎤ q11 q12 · · · q1k ⎢ q21 q22 · · · q2k ⎥ ⎢ ⎥ (1) M =⎢ . . . . ⎥ ⎣ .. .. . . .. ⎦ qn1 qn2 · · · qnk Because of the differences in the range of QoS values and the effect on the comprehensive service performance, the QoS values should be normalized by the min-max normalization [11]. According to the impact on the comprehensive performance of the service, QoS attributes can be classified into positive effect attributes and negative effect attributes. The larger value of positive effect attributes (such as reliability, availability, reputation and other attributes) or the smaller value of negative attributes (such as cost, response time, and other attributes), the better overall performance of the service. Assuming that the

QoS-Driven Service Matching Algorithm Based on User Requirements

21

range of qi is [min (qi ) , max (qi )], positive and negative effect attributes should be normalized by formula (2) and (3) respectively. qi −min(qi )  , max (qi ) − min (qi ) = 0 (2) qi = max(qi )−min(qi ) 1, max (qi ) − min (qi ) = 0 

qi =

max(qi )−qi max(qi )−min(qi ) ,

1,

max (qi ) − min (qi ) = 0 max (qi ) − min (qi ) = 0

(3)

All QoS values are distributed between [0, 1] after normalization. The comprehensive performance of the service is enhanced with the increase of each QoS value, that is, the larger the QoS value, the better the service performance. Service Request. A service request sent from the user to the service platform when the service discovery is performed can be represented as Req = {Qreq , Mreq }, where Qreq = (α1 , α2 , · · · , αk ) is a QoS request vector and α1 , α2 , · · · , αk represent the user’s expected values for k attributes qi1 , qi2 , · · · , qik . The QoS values in the request vector, α1 , α2 , · · · , αk , should be normalized by for    mula (2) or (3), so we can get α1 , α2 , · · · , αk . Then Qreq is converted to Qreq . The priority vector is Mreq = (m1 , m2 , · · · , mj ), j ∈ {1, 2, · · · , k}, and j means the j-th attribute in Qreq as the priority attribute of the request Req. Mreq including one or more priority attributes is defined by the user requirements, which fully reflects the user’s preference for the QoS attributes of the target service. The user requirement emphasizes the importance of the j-th attribute    qj in the target service. And qj is expected to satisfy the requirement of αj in    Qreq as much as possible, i.e., qj ≥ αj . 

Priority. The priority of the service request depends on αj in the QoS request  vector Qreq . Suppose h is the user’s expected value of a certain QoS attribute,  i.e., h = αj . The priority of the request can be calculated by formula (4). ⎧ ⎨ 1, h ∈ [0, T1 ) P rior(h) = 2, h ∈ [T1 , T2 ] (4) ⎩ 3, h ∈ (T2 , 1] T1 and T2 are single performance thresholds that is used to determine the priority of the service request. The values of T1 and T2 are in the range of [0, 1], and T1 ≤ T2 . The priority of the service request Req can be divided into three levels of 1, 2, and 3, which respectively represent the low, medium, and high of the priority. According to the request priority, different matching strategies are selected. The matching strategy set can be represented as M S = {M SH , M SM , M SL }, where M SH ,M SM and M SL respectively indicate the matching strategies of different priority.

22

M. Guo and X. Yang

QoS Performance Evaluation Value. QoS performance evaluation value is classified to request performance evaluation value QoSreq and service performance evaluation value QoSser . QoSreq is selected by the expected QoS value

 2   

from user and it can be represented as QoSreq = Qreq = α12 +α22 +· · ·+αk2 =     k    2 is the QoS request vector after nori=1 αi , where Qreq = α1 , α2 , · · · , αk

 2

malization. The QoSser of service Si can be represented as QoSser (i) = Qi =     k       qi12 +qi22 +· · ·+qik2 = j=1 qij2 , where Qi = qi1 , qi2 , · · · , qik is the QoS attribute vector after normalization. The Utility of Service Matching. U (i) is the utility of the service matching algorithm when the service Si is selected as the target service satisfying the request Req. It is classified to single performance utility value US (i) and comprehensive services utility value UC (i). US (i) is the ratio of a certain QoS attribute of Req to that of Si , and can be represented as formula (5). UC (i) is the ratio of the overall performance evaluation value of Req to that of Si , and can be represented as formula (6). U (i) is the weighted sum of US (i) and UC (i), and it can be represented as formula (7).   h/qij , h < qij US (i) = (5)   qij /h, h ≥ qij  UC (i) =

QoSreq /QoSser (i), QoSser (i)/QoSreq ,

QoSreq < QoSser (i) QoSreq ≥ QoSser (i)

U (i) = μ × US (i) + (1 − μ) × UC (i)

(6) (7)

The μ is weighted factors in the range of [0, 1]. The impact of US (i) and UC (i) on U (i) can be adjusted through μ. In the matching process, the greater utility, the more matched with the user requirements the service is.

4

Service Matching Algorithm

The QoS-based service matching algorithm can be roughly classified to two methods: single-QoS performance matching and overall-QoS performance matching. In the QPSM algorithm, service selection and matching are performed according to user-defined priority attributes and QoS. So the most suitable service to user requirements can be matched. QPSM algorithm is proposed as Algorithm 1. The main idea of the algorithm is selecting the corresponding matching strategy according to the priority of user request, and selecting the service that is most suitable to the user. The priority of user request is determined by the specified priority attributes, and the different matching strategies are adopted according to the priority. When the request priority is determined as a high priority, the target service must satisfy

QoS-Driven Service Matching Algorithm Based on User Requirements

23

Algorithm 1. QoS-based service matching algorithm (QPSM) Input: (1)S // Service Set (2)Req // User Requirements Output: Ser match // All services that suit for user 1 Initialize Req, S and its corresponding QoS attribute matrix M ; 2 Determine the priority of the request;  3 Compose priority service set Ser prior : qij ≥ h; 4 Compose the candidate service set Ser wait : QoSser (i) ≥ QoSreq ; 5 while Req is not empty do 6 if P rior(h)=3 then 7 if Ser prior = ∅ then 8 Ser match ← null 9 else 10 Ser match ← the largest QoSser (i) from Ser prior 11 end 12 end 13 if P rior(h)=1 then 14 if Ser wait = ∅ then 15 Ser match ← the largest QoSser (i) from S 16 else 17 Ser match ← the minimum QoSser (i) from Ser wait 18 end 19 end 20 if P rior(h)=2 then 21 if Ser prior = ∅ and Ser wait = ∅ then 22 Ser match ← the largest QoSser (i) from Ser prior 23 end 24 if Ser prior = ∅ and Ser wait = ∅ then  25 Ser match ← the largest qij from Ser wait 26 end 27 if Ser prior = ∅ and Ser wait = ∅ then 28 Ser match ← the largest U (i) from S 29 end 30 if Ser prior = ∅ and Ser wait = ∅ then 31 if Ser inter = Ser prior ∩ Ser wait = ∅ then 32 Ser match ← the largest U (i) from Ser inter 33 else 34 if Ser union = Ser prior ∪ Ser wait = ∅ then 35 Ser match ← the largest U (i) from Ser union 36 end 37 end 38 end 39 end 40 end 41 return Ser match;

the priority attributes completely with the user requirements. When the request priority is judged as a low priority, a service with the smallest service performance evaluation value which satisfies the user request performance evaluation value is

24

M. Guo and X. Yang

selected. So the load of the entire service system is balanced and the optimized matching of resources is achieved. When the request priority is judged as a medium priority, the user request and service performance are weighed, and the service selection is determined by the utility of service matching. Ser match, a matching service set, is composed of services selected by priority attributes. When the number of priority attributes is more than one, a conflict of matching policy selection may occur. The merging of matching services is to merge the services in Ser match and finally the most suitable service is selected for the user. Algorithm 2 shows the whole procedure of matching service merging.

Algorithm 2. Merge matching service Input: Ser match // Matching Service Set Output: Ser result // The most suitable service for users    1 Initialize α ∈ {α1 , · · · , αk }, i ∈ {1, · · · , n}, j ∈ {1, · · · , k}; 2 for Ser match = ∅ do  3 if num(P rior(α ) = 3) ≥ 1 then   4 if num(Ser match(qij ≥ αj )) ≥ 2 then   5 Ser result ← the largest U (i) from Ser match(qij ≥ αj ) 6 end   7 if num(Ser match(qij ≥ αj )) = 1 then   8 Ser result ← Ser match(qij ≥ αj ) 9 end   10 if num(Ser match(qij ≥ αj )) = 0 then 11 Ser result ← null 12 end 13 end  14 if num(P rior(α ) = 3) = 0 then 15 if num(Ser match) ≥ 2 then 16 Ser result ← the largest U (i) from Ser match 17 else 18 Ser result ← Ser match 19 end 20 end 21 end 22 return Ser result;

5

Experiment Analysis

The main purpose of the QPSM algorithm is to select the most suitable service for the user according to user-defined QoS request. In order to verify the feasibility and effectiveness of this algorithm, it is compared with the other two QoS-based matching algorithms, Single-QoS and Overall-QoS, in four aspects that is response rate, load, average single performance value and overall performance value. All the experiments were conducted on a computer with a 3.2

QoS-Driven Service Matching Algorithm Based on User Requirements

25

GHz Intel Core 2 Duo CPU and 12 GB RAM. The data used for the experiment derived from two sources: a data set containing 1000 actual services and 5 QoS values, and a randomly generated user request data set. The purpose of the first experiment is to evaluate the response rate of the algorithm, that is the ratio of successfully matched and returned requests to the total requests. In this experiment, 100 services are selected for matching and 1000 service requests are randomly generated. The response rates of this three algorithms are shown in Fig. 1. As the number of user requests increase, the response rate of each algorithm tends to be stable. The QPSM algorithm outperforms other algorithms with the highest response rate at about 96%. However, the response rate of the Single-QoS algorithm [8] is the lowest at about 88%. The reason for this result is that the Single-QoS algorithm will fail to respond when all candidate services do not satisfy the QoS constraints. The Overall-QoS algorithm [10] will fail to respond when the overall performance is lower than user request performance. In QPSM algorithm, the matching results will be found through a comprehensive consideration of user requirement and service performance. 1

Response rate

0.95 0.9 0.85

Overall-QoS QPSM Single-QoS

0.8 0.75

0

100

200

300

400

500

600

700

800

900

1000

Number of Service Requests

Fig. 1. The response rate of the algorithm with the number of user requests

The second experiment is to evaluate the effect of load balancing, that is indicated by the number of times that services with different QoS performance respond to requests. In this experiment, 5 candidate services with the same function and the different QoS are selected and 1000 service requests are randomly generated. The distributions of service load by using traditional UDDI [5] algorithm and QPSM algorithm are compared. And the load distributions of QPSM algorithm with different single performance thresholds T1 and T2 are tested. Figure 2 shows that the QPSM algorithm outperforms the UDDI algorithm in term of load balancing when the number of service requests is the same. The greater difference between T1 and T2 , the better performance of load balancing. Because the greater difference between T1 and T2 , the more service requests are judged to be medium priority, and the effect of load balancing is better. The third experiment is to evaluate the average service single-performance value and the overall-performance value. In this experiment, 1000 services used for matching are selected and 1000 user requests with high demand for response

26

M. Guo and X. Yang 35 UDDI T1=0.5 T2=0.8 T1=0.2 T2=0.8

Load rate (%)

30 25 20 15 10 5 0

S1

S2

S3

S4

S5

Candidate Services

1

1

0.9

0.9

Average response time

Average reliability

Fig. 2. Distribution of service matching load rate

0.8 0.7 0.6 0.5 0.4 0.3

0

200

400

600

800

1000

0.8 0.7 0.6 0.5 0.4

0

200

Number of Service Requests

400

600

800

1000

Number of Service Requests

(a) Average reliability

(b) Average response time

Overall-performance

3.2 3 2.8 2.6 2.4 2.2 2

0

200

400

600

800

1000

Number of Service Requests

(c) Overall service performance

Fig. 3. Service single-performance and overall-performance with the number of user requests

time and reliability are randomly generated. The μ in the service matching utility U (i) is taken as μ = 0.2 and μ = 0.8 respectively. Figure 3 shows that the larger μ, the higher average reliability of the matching service, the shorter response time, and the lower overall service performance value. Because the value of μ determines the proportion of single performance utility value US (i) and comprehensive services utility value UC (i) in the utility of service matching U (i), and affects the final service selection further. The users can select the appropriate μ according to their requirements.

QoS-Driven Service Matching Algorithm Based on User Requirements

6

27

Conclusion

Due to the uncertainty caused by the dynamic change of service QoS and the ambiguity of user requirements, there are some limitations in the current service matching algorithms. In order to describe the QoS attributes more accurately, we propose a time-segmented QoS model on the consideration of time. Based on this model, a service matching algorithm based on user QoS request and priority is also proposed. In this algorithm, user requirements and QoS performance preferences is fully considered. And the most suitable service is selected according to user-defined service requests and priorities, which is more suitable for users with specific requirements. Finally, experimental results indicate that the proposed algorithm can achieve a higher response rate and a better effect of load balancing.

References 1. Benslimane, D., Dustdar, S., Sheth, A.: Services mashups: the new generation of web applications. IEEE Internet Comput. 12(5), 13–15 (2008) 2. He, Q., Yan, J., Jin, H., Yang, Y.: Quality-aware service selection for service-based systems based on iterative multi-attribute combinatorial auction. IEEE Trans. Softw. Eng. 40, 192–215 (2014) 3. Zhao, S., Wu, G., Zhang, S.: Review of QoS research in SOA. Comput. Sci. 36(4), 16–20 (2009) 4. Klein, A., Ishikawa, F., Honiden, S.: SanGA: a self-adaptive network-aware approach to service composition. IEEE Trans. Serv. Comput. 7(3), 452–464 (2014) 5. Guo, D., Ren, Y., Chen, H.: A QoS constrained web service selection and ordering model. J. Shanghai Jiaotong Univ. 41(6), 870–875 (2007) 6. Zhao, S., Zhang, Y., Yu, L., Cheng, B., Ji, Y., Chen, J.: A multidimensional resource model for dynamic resource matching in internet of things. Concurr. Comput. Pract. Exp. 27(8), 1819–1843 (2015) 7. Li, L., Liu, N., Li, G.: A QoS-based dynamic service composition method in semantic internet of things. Appl. Res. Comput. 33(3), 802–805 (2016) 8. Zeng, L., Benatallah, B., Ngu, A.H.H., Dumas, M., Kalagnanam, J., Chang, H.: Qos-aware middleware for web services composition. IEEE Trans. Softw. Eng. 30(5), 311–327 (2004) 9. Cardoso, J., Sheth, A., Miller, J., Arnold, J., Kochut, K.: Quality of service for workflows and web service processes. Web Semant. Sci. Serv. Agents World Wide Web 1(3), 281–308 (2004) 10. Jia, B., Li, W., Zhou, T.: A centralized service discovery algorithm via multi-stage semantic service matching in internet of things. In: 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), pp. 422–427 (2017). https://doi.org/10.1109/CSE-EUC.2017.82 11. Chen, L., Yang, J., Zhang, L.: Time based QoS modeling and prediction for web services. In: Kappel, G., Maamar, Z., Motahari-Nezhad, H.R. (eds.) ICSOC 2011. LNCS, vol. 7084, pp. 532–540. Springer, Heidelberg (2011). https://doi.org/10. 1007/978-3-642-25535-9 38

Research on Overload Classification Method for Bus Images Based on Image Processing and SVM Tingting Li1 , Yongxiong Sun2 ✉ (

)

, Yanhua Liang1 , Yujia Zhai2 , and Xuan Ji2

1

2

College of Software, Jilin University, Changchun 130012, China College of Computer Science and Technology, Jilin University, Changchun 130012, China [email protected]

Abstract. The speed and efficiency of overloaded artificial screening bus images are relatively low, which results in a large number of human resources waste prob‐ lems. Therefore, an overload classification method for bus images based on image processing and support vector machine was proposed to intelligently identify the image overload or not. Based on the consideration we have done the following work. Firstly, the bus images were preprocessed, including image enhancement using histogram equalization method and image segmentation using improved Otsu algo‐ rithm; Secondly, the features of the segmented images was extracted by Kirsch edge detection operator to establish the image feature sample library; Finally, the appro‐ priate kernel function and parameters were chosen to establish a classifier model based on support vector machine, which can train the sample library to classify the bus images. Theoretical analysis and experimental results show that the average classification accuracy of the polynomial kernel function is better than those of the Gaussian kernel function and the Sigmoid kernel function in the finite range of parameters selection. When the parameter d of the polynomial kernel function is 4, the classification accuracy is 93.68%, and its classification performance is stable and there is no significant increase or fall. And the conclusion was verified in the actual application. Keywords: Bus overload · Image segmentation · Image feature extraction Support vector machine · Image classification

1

Introduction

The bus overload refers to the number of passengers in vehicles exceeding the authorized number of passengers. The bus overload is a direct threat to the safety of the passengers. Once a traffic accident occur, it will lead to casualties and have a significant influence on society [1]. In order to prevent vehicles from overloading as much as possible, the public security, transportation, highway and other departments take active measures. On the one hand, they actively propagandize the danger of overload to enhance the safety awareness of passengers. On the other hand, they use different kinds of advanced tech‐ nology to supervise overload, such as installed the driving recorders, cameras, and other monitoring equipment in the bus [2]. These measures not only reduce the waste of

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 28–43, 2018. https://doi.org/10.1007/978-3-030-05057-3_3

Research on Overload Classification Method for Bus Images

29

manpower and material resources, but also investigate by evidence and punish the exis‐ tence of overloaded illegal vehicles. At present, most provinces and cities in China still use manual recognition method to classify images which are photographed by a camera that is installed in the bus to determine whether the bus is overloaded. Although the accuracy of the manual identi‐ fication method is high, the efficiency is low. Therefore, the manual identification method cannot meet the current regulatory needs [3]. In order to solve the problem of artificial identification, an overloaded classification method for bus images based on image processing and support vector machine (SVM) is proposed. Compared with existing artificial recognition methods, this method can automatically recognize the overloaded bus images, which saves a lot of human resources and improves the speed and quantity of illegal images recognition [4]. Simultaneously, it has greatly improved the speed and quantity of illegal images identification.

2

Pretreatment of Bus Images

The purpose of this paper is to classify bus images to detect overloaded buses by using image processing and support vector machine. Image preprocessing is the precondition of image classification. It plays an important role in classifying overloaded bus images. The experimental data are derived from the historical data of the transportation depart‐ ment, Jilin City, Jilin Province.

Fig. 1. First group of bus image enhancement effect graph and contrast histogram

30

T. Li et al.

2.1 Histogram Equalization In this paper, the images were taken by the cameras installed in the bus and the quality is poor. Thus, the image enhancement is necessary before image segmentation. We used histogram equalization to enhance original images, which can make the distribution of whole image gray tend to be uniform. The process is as follows:

Fig. 2. Second group of bus image enhancement effect graph

Calculate the number of pixels for the each grayscale of original bus images ni , i = 0, 1, … , L − 1, where L is the total gray level for the image. 1. Calculate the original image histogram, the formula is: ( ) n Pi ri = i n

(1)

Where: n is the total number of pixels for the original images. 2. Calculate the cumulative distribution function, the formula is: k ( ) ∑ ( ) sk rk ≈ Pi ri (k = 0, 1 … , L − 1)

(2)

i=0

3. Calculate the output gray level, which can be written in the form:

]/ [( ) ( ) gk = INT gmax − gmin sk rk + gmin + 0.5 (L − 1) Where k = 0, 1 … , L − 1, INT[] is rounding operator.

(3)

Research on Overload Classification Method for Bus Images

31

In the formula (3), when gmin = 0, gmax = L − 1, the formula (4) can be written in the form: ( ) ]/ [ gk = INT (L − 1)sk rk + 0.5 (L − 1)

(4)

4. We can get output images by modifying original images, which is based on the mapping relation between the gray level function (rk ) of original images and output gray level function (gk ). 5. Implementing (Implement) image enhancement to two groups of original bus images, the corresponding results are shown in Figs. 1 and 2. 2.2 Image Segmentation In order to classify the overloaded bus images by using support vector machine, we need to extract target area from background area to obtain training data. Thus, it is important to segment target area from original image. Threshold Segmentation is one of the first image segmentation methods, which is simple and effective. It includes the maximum between-class variance method, the minimum cross-entropy threshold method, the maximum entropy threshold method and the maximum correlation threshold method [5].

Fig. 3. Comparison of image segmentation effects of four segmentation methods

Through analyzed bus images, we regard the aisle in the image as the target area, and surrounding passengers as background area. Then we processing a same image used

32

T. Li et al.

four traditional segmentation methods mentioned above. The corresponding results are shown in Fig. 3. As shown in Fig. 3, the four segmentation methods mentioned above all lead to noises and holes, which has greatly effect features extraction. Therefore, in this paper, we first process the bus images used threshold segmentation, and then closed operation is repeatedly used to remove noises and fill holes. Closed operation is a more advanced morphological transformation that combines expansion and corrosion algorithms [6], which expansion operation is firstly used to process segmented images, and then imple‐ ment corrosion operation to above results. We processed images in Fig. 3 by using three times of closed operations, and the results are shown in Fig. 4.

Fig. 4. Effect graph using threshold segmentation and closed operation

As shown in Fig. 4, the traditional maximum relevant threshold segmentation method has worst results, and the traditional Otsu has the best effects which can effectively separate target areas from background areas, apart from several connecting pixels.

Fig. 5. Gray histogram of graph a in Fig. 3

In this paper, we select the middle aisle of the bus images as the targets of training samples. Figure 3(a) is a normal bus image in which aisle region accounts for one fifth

Research on Overload Classification Method for Bus Images

33

of original image and non-aisle region accounts for four fifth of original image. Thus, compared with background area, the target area is much small. The gray histogram of graph (a) is shown in Fig. 5. As shown in Fig. 5, there are less pixels on the left, and the gray distribution of the middle pixels is uniform, and the gray of the rightmost almost reaches peak. That means the pixels of target area focus on left, and the pixels of background focus on middle and right. The gray scale of background area is bigger than the gray scale of target area. Owing to small variance in target area and big variance in background area, the tradi‐ tional Otsu method makes threshold prefer big variance area, which leads to calculated threshold bigger than ideal threshold and has poor segmentation results. In order to improve the quality of images segmentation and the accuracy of identification over‐ loaded bus images, in this paper, we try to modify traditional Otsu method. The original formula of Otsu algorithm can be written in the form: 𝜎(t) = 𝜔0(𝜇0 − 𝜇)2 + 𝜔1(𝜇1 − 𝜇)2 = 𝜔0𝜔1(𝜇1 − 𝜇0)2

(5)

Where: 𝜔0 is the probability of target class, and 𝜔1 is the probability of background class. It means that the target area and background area are weighted [7]. In this paper, we adjust weighting by descending and ascending the power of 𝜔0 and 𝜔1. The improved Otsu formula is: 𝜎(t) = 𝜔0𝛼 (𝜇0 − 𝜇)2 + 𝜔1𝛽 (𝜇1 − 𝜇)2 = 𝜔0𝛼 𝜔1𝛽 (𝜇1 − 𝜇0)2

(6)

Where 𝛼 represents the proportion of background area in the whole image, and 𝛽 is the reciprocal of 𝛼, which makes the algorithm have no biases to target class. By modi‐ fying original formula, we can ensure that the threshold will not be so high when the

Fig. 6. Comparison between traditional Otsu algorithm and improved Otsu algorithm

34

T. Li et al.

variance of one class is bigger than the other, at the same time, the gray level between two classes is more balanced. The results from traditional Otsu algorithm and improved Otsu algorithm are shown in Fig. 6. As shown in Fig. 6, the passengers in the background area are not classified into target area, while the improved Otsu algorithm can effectively separate target area from background area. Therefore, in this paper, we use improved Otsu algorithm and close operation to segment bus images, which resolves the effects of noise and holes, and provides a good base for features extraction.

3

Bus Image Feature Extraction

After image enhancement and segmentation, we select Kirsch operator to extract segmented image features, and build an image features database which is used to classify bus images using support vector machine. Kirsch operator calculates convolution and derivative for each pixel using eight templates. The eight templates represent eight directions, making the maximal response to the eight specific edge directions of the images. The output of Kirsch operator is the maximum of eight directions. Kirsch is an effective edges detection operator, which can significantly suppress the noise from edge detection [8]. Assuming original image is shown in Fig. 7. a3

a2

a1

a4

(i,j)

a0

a5

a6

a7

Fig. 7. A 3 × 3 sub-picture of the original image

The gradient of the edge is:

[ ( )] G(i, j) = max 1, max ||5Sk − 4Tk ||:k = 1, 2, … 8

(7)

Where Sk = xk+1 + xk+2 + xk+3,Tk = xk+4 + xk+5 + … + xk+8, k equals 1 to 8 repre‐ senting the 8-direction template, as shown in Fig. 8. The Kirsch operator is based on a fact that the gray scale for the non-edges of image is smaller than threshold and the gray scale for the edges of image is bigger than threshold. When detecting image edges, we first use a lower threshold to binarize the original images, then detect target area and background area. The target area and the background area can be effectively divided by the boundary regions whose gray scale is bigger than the threshold [9]. By using he method mentioned above, we preprocess two groups of original bus images and extract corresponding features. The results are shown in Fig. 9 .

Research on Overload Classification Method for Bus Images

35

Fig. 8. Eight directions template

Fig. 9. Two sets of bus original images and features extraction

Fig. 10. Unloaded and overloaded image extraction aisle shape effect

For the classification of bus images, we only concern the target area information. In order to reduce calculation and improve the accuracy of classification, we need to avoid

36

T. Li et al.

the influence of non-target area after extracting image outlines. In this paper, we process simply process the extracted outline images, and then extract the shape of aisle position as sample data for the image features database. Figure 10 shows extracted aisle shapes. In this paper, we process 551 bus images. Some results are shown in Fig. 11.

Fig. 11. Part of the bus image feature samples

4

Image Classification Based on Support Vector Machine

Analyzed image features of target area, we can find that the image features of target area that there are passengers in the aisle are significant from the image features of target area that there are no passengers in the aisle. Therefore, we can recognize overloaded bus images by using the shapes of feature images for target area. We can divide training data into two parts, positive training set and negative training set, where positive training set stores outline feature samples from non-overloaded bus images, and negative training set stores outline feature samples from overloaded bus images. We can use support vector machine to classify bus images after constructing two training sets. Support vector machine is very effective for linear classification problems [10]. For a nonlinear classification problem, we can transform it into a linear problem by nonlinear transformation function, which makes it linearly separable in a high-dimensional space [11]. For a nonlinear classification problem, the solution to the optimal classification surface is equal to the following question:

Minimize

Subject to

ϕ(w, 𝜉) =

n ∑ 1 ‖w‖2 + C( 𝜉i ) 2 i=1

[ ] yi (wT xi ) + b − 1 + 𝜉i ≥ 0, 𝜉i ≥ 0, i = 1, 2, … , n

(8)

(9)

Where: C > 0 is a penalty coefficient. This is a quadratic programming problem that can be solved by the Lagrange method and translated into the following questions:

Research on Overload Classification Method for Bus Images

Q(𝛼) =

Maximize

n ∑

1∑ 𝛼 𝛼 y y (x ⋅ x ) 2 i,j=1 i j i j i j

(10)

yi 𝛼i = 0

(11)

n

𝛼i −

i=1 n ∑

Subject to

37

i=1

0 ≤ 𝛼i ≤ C, i = 1, 2, … , n

(12)

The weight coefficient for the optimal classification surface is:

w=

n ∑

𝛼i yi xi

(13)

i=1

It can be seen that the weight coefficient of the optimal classification surface is the linear combination of training samples. From the formula (12), the smaller the penalty coefficient C is, the smaller the Lagrange multiplier is. Likewise, from formula (13), the smaller 𝛼i is, the smaller ‖w‖ is, which means that the bigger interval between two classes can improve the generalization performance of SVM. The smaller C is, the bigger the interval between two classes is and the better generalization performance the SVM has, which leads to reduce the accuracy of SVM. On the contrary, the bigger C is, the smaller the interval between two classes is and the poorer generalization performance the SVM has, which leads to improve the accuracy of SVM. Therefore, the penalty coefficient affects the generalization performance and accuracy of SVM. The value of C should select appropriate. In this paper, we classify bus images based on SVM by choosing an appropriate kernel function. The type of kernel function significantly affects the performance of SVM. Three common kernel functions are used in this paper, including Polynomial kernel function, Gaussian kernel function and Sigmoid kernel function [12]. They can be written in the forms as following: Polynomial kernel function:

[ ]d K(xi , xj ) = (xi ⋅ xj ) + 1

(14)

Where: d is the degree of the polynomial. Gaussian kernel function: ) ( | |2 K(xi , xj ) = exp −𝜎 |xi − xj | | |

(15)

Sigmoid kernel function: K(xi , xj ) = tanh(𝜎(xi , xj ) + c)

(16)

Among them, the polynomial kernel function is related to d, and the gaussian kernel function is related to 𝜎. In the paper, we compare the accuracy of the SVM model with

38

T. Li et al.

different kernel functions through a large number of testing images. Finally, we choose the optimal classifier. In this paper, we classify overloaded bus images based on image processing and support vector machine. Firstly, we select training samples from standard samples library, and preprocess the selected images, including histogram equalization, images segmentation and closed operation. Secondly, we extract the edge features of the prepro‐ cessed images and build a feature samples training set. Then, we select appropriate kernel function and parameters, and train a model used support vector machine on training set. Finally, we use the trained model to predict the class label of testing set and calculate the accuracy of the model. The whole flow chart is shown in Fig. 12.

Fig. 12. Image overload and classification based on image processing and SVM

5

Experiments and Results

The purpose of this paper is to divide bus images into non-overloaded images and over‐ loaded images based on images processing and support vector machine. It is difficult for us to determine which type of kernel function is best when features mapping is unknown. Therefore, the performance of model is significantly related to the choice of kernel functions. At present, many researchers make a choice based on the generalization error of the classifier through a great many of experiments [13]. In this paper, 897 bus integral images are used as a sample database which includes 36 obstructed images and 861 normal images. In order to analyze the experimental results, 861 normal images are selected as the standard dataset. The dataset consists of two types of images, of which 669 are non-overloaded and 192 are overloaded. The resolution of each image is 352 × 288. We divide dataset into training and testing dataset by using “Set aside method” [14]. The so-called “Set aside method” is a popular sampling method, which means that dataset D is divided into two mutually exclusive sets that one of the sets is the training set S, the other is the testing set T. After training a model used the training set S, the testing set T is used to calculate testing error to

Research on Overload Classification Method for Bus Images

39

estimate the generalization error of the model. In this paper, 426 non-overloaded images and 125 overloaded images are selected randomly from the standard 861 images datasets as the training set, and the remaining 310 images (243 non-overloaded images and 67 overloaded images) as the testing set. In order to ensure these testing samples are not used in the training process, in this paper, we select precision as an evaluation indicator, which is the proportion of correct classified samples in the testing set. Each experiment is carried out repeatedly through 5 randomly dividing, and the evaluation result is based on the mean of five times. The result has two digits after the decimal point. The purpose of this experiment is to observe the classification accuracy of the clas‐ sifier under different parameters of different kernel functions, and to select the kernel function and parameters that are most suitable for this project. For polynomial kernel function, d value is 1, 2, 3, 4 and 5, respectively. For Gaussian kernel function, 𝜎 value is 0.1, 0.5, 1, 2 and 5, respectively. For Sigmoid kernel function, make 𝜎 = 1, c value is 0.5, 0.3, 0.1, 0, −0.1, −0.3 and −0.5. Meanwhile, according to the literature [15], the penalty factor C value is 100. The following is the classification accuracy and the graph of the three kernel functions with different parameters (Table 1). Table 1. Classification accuracy of polynomial kernel function with different parameters Group 1 2 3 4 5 Mean(%)

Parameter d 1 80.65 80.00 81.61 80.97 82.26 81.10

2 84.84 85.48 87.10 85.16 86.45 85.81

3 89.68 91.61 90.32 90.97 88.71 90.26

4 93.23 93.55 94.19 93.87 93.55 93.68

5 89.68 89.68 90.00 89.68 89.03 89.61

Figure 13 shows the average classification accuracy of polynomial kernel function with different parameters.

Fig. 13. Mean accuracy curve of polynomial kernel function with different parameters

As can be seen from the trend of the curve in Fig. 13, the classification accuracy of different parameters of the polynomial kernel function is different. With d increasing,

40

T. Li et al.

the classification accuracy of the model first increases and then decreases. When d is 4, the classification effect is the best, reaching 93.68%. For the experimentally selected parameter d, the average classification accuracy fluctuates within a limited range of 81.10%–93.68%. The performance of the model is relatively stable (Table 2). Table 2. Classification accuracy of RBF kernel function with different parameters Group 1 2 3 4 5 Mean(%)

Parameter d 1 67.74 68.39 70.32 69.35 67.42 68.64

2 90.32 89.68 88.71 90.32 90.97 90.00

3 85.16 83.87 84.52 84.52 83.23 84.26

4 74.19 75.81 74.84 75.81 75.16 75.16

5 70.97 70.00 69.35 71.61 70.00 70.39

Gaussian kernel function with different parameters of the average classification accuracy curve is shown in Fig. 14 .

Fig. 14. Mean accuracy curve of RBF kernel function under different parameters

Fig. 15. Mean accuracy curve of Sigmoid kernel function under different parameters

Research on Overload Classification Method for Bus Images

41

It can be seen from Fig. 14 that the Gaussian kernel function has different classifi‐ cation accuracy with different parameters. For the Gaussian kernel function, its param‐ eters are within a limited range selected, when the value is 0.1, the classification effect is poor. When the value is 0.5, the classification effect is the best; then with the increase of 𝜎, the classification accuracy drops and is not very stable. The average accuracy curve of Sigmoid kernel under different parameters is shown in Fig. 15. By analyzing the experimental data of Table 3 and the average precision curve of Fig. 15, the classification accuracy of Sigmoid kernel function fluctuates in the range of 72.32%–88.77%. When c is −0.3, the classification accuracy is the best. Simultaneously, when c takes a negative value, its classification accuracy is better than that of a positive value, which accords with the analysis of Sigmoid kernel in ref. [16]. Table 3. Classification accuracy of Sigmoid kernel function under different parameters Parameter c Group 1 −0.5 80.65 −0.3 88.06 −0.1 84.52 0 78.06 0.1 76.13 0.3 80.65 0.5 71.94

Mean(%) 2 82.26 87.10 85.16 79.03 75.81 80.65 70.97

3 81.61 89.68 85.48 78.71 75.81 81.29 72.58

4 83.87 90.32 85.16 80.00 77.42 80.00 72.58

5 80.65 88.71 83.87 78.71 76.77 80.97 73.55

81.81 88.77 84.84 78.90 76.39 80.71 72.32

By comprehensively analyzing the three kernel functions selected in this paper, the classification of multiple kernel functions is obviously better than the other two kernel functions. For Gaussian kernel function, only when 𝜎 is 0.5, the classification accuracy reaches 90.00%. When the 𝜎 takes other values, classification effect is not stable. For the Sigmoid kernel function, it’s classification performance is also unstable, and appears the oscillating phenomena. The average classification accuracy among the three of the highest is the polynomial kernel function. The precision is also the highest, up to 93.68%, and the classification performance is relatively stable. In general, it is the best choice to use polynomial kernel function parameter d as 4 in the bus images overload classification in this paper. But it should be noted that it is the best when select kernel function and parameters are only within a limited range. It can be seen from the above experiments that the average successful rate of bus overload classification using the image classification method based on support vector machines reaches up to 93.68%. And when applying it to the traffic visualization system in Jilin Province, the accuracy rate can still reach about 93%. So the use of image processing and support vector machine technology can achieve bus overload detection.

42

6

T. Li et al.

Conclusion

In this paper, based on image enhancement, improved threshold segmentation, and closed operation processing of images of interior passengers photographed inside the bus, feature extraction is performed on these preprocessed image samples to establish a training set, and an appropriate kernel function is then selected. The SVM model is established with the parameters and completes the sample training of the training set. The automatic classification of the imported image is finally completed, and the over‐ loaded image are intelligently identified. Finally, for the images obtained in this paper, through the comparative analysis of multiple sets of experiments, we notice that when the polynomial kernel function parameter d value is 4, the classification accuracy is the highest. Increasing the recog‐ nition speed and efficiency of the overloaded images on buses can save a lot of human resources and increase penalty rates for violations. So, the method of bus image overload classification based on image processing and support vector machines has great values. However, compared with the ideal classification accuracy of 100%, there is a certain distance. How to further improve the classification accuracy is the future work.

References 1. Ding, C.: The effect of overloaded cars and the tire pressure on the stress distribution of the road. Int. J. Intell. Inf. Manag. Sci. 5(3), 264–267 (2016) 2. Wang, W.L., Lu, C.Z., Li, Y.R.: Basic economic measures in long-term effective mechanism for administering overload and oversize of motor vehicles. Int. J. Intell. Inf. Manag. Sci. 24(6), 148–152 (2007) 3. Zhang, Z., Cheng, W., Wu, L., et al.: Study on circular traffic signs recognition method based on invariant moments and SVM. J. Electron. Meas. Instrum. 31(5), 773–779 (2017) 4. Zhao, G.Q., Wang, F.J.: Car train overload signal monitoring system optimization modeling research. Comput. Simul. 33(11), 162–163 (2016) 5. Wu, Y.Q., Meng, T.L., Wu, S.H.: Research progress of image thresholding methods in recent 20 years (1994–2014). J. Data Acquis. Process. 30(1), 1–23 (2015) 6. Yan, J.Z., Lin, S., Sing, B.K.: Change-based image cropping with exclusion and compositional features. Int. J. Comput. Vis. 114(1), 74–87 (2015) 7. A R Correspondng’s scientific contributions, Venmathi, Venmathi, A.R., et al.: Kirsch compass kernel edge detection algorithm for micro calcification clusters in mammogram. Middle East J. Sci. Res. 24(4), 1530–1535 (2016) 8. Liu, D.H., Zhang, Y.D., Li, X., et al.: Adaptive thresholding method under the dynamic environment. J. Comput. Appl. 36(S2), 152–156 (2016) 9. A R Correspondng’s scientific contributions, Venmathi, A.R., Venmathi, E.N., Ganesh, N.K.: Kirsch Compass kernel edge detection algorithm for micro calcification clusters in mammograms. Middle East J. Sci. Res. 24(4), 1530–1535 (2016) 10. Thang, P.Q., Thuy, N.T., Lam, H.T.: A modification of solution optimization in support vector machine simplification for classification. In: Bhateja, V., Nguyen, B.L., Nguyen, N.G., Satapathy, S.C., Le, D.-N. (eds.) Information Systems Design and Intelligent Applications. AISC, vol. 672, pp. 149–158. Springer, Singapore (2018). https://doi.org/ 10.1007/978-981-10-7512-4_15

Research on Overload Classification Method for Bus Images

43

11. Zhi, J., Sun, J., Wang, Z., Ding, W.: Support vector machine classifier for prediction of the metastasis of colorectal cancer. Int. J. Mol. Med. 41(3), 1419–1426 (2018) 12. Mcdonald, G., Macdonald, C., Ounis, I.: A study of SVM kernel functions for sensitivity classification ensembles with POS sequences. In: SIGIR 2017, pp. 1097–1100 (2017) 13. Yang, L., Wang, Y.: Survey for various cross-validation estimators of generalization error. Appl. Res. Comput. 32(5), 1287–1290 (2011) 14. Zhou, Z.H.: Machine Learning. 2nd edn. Tsinghua University Press, Beijing (2016) 15. Yu, Z., Wong, H.S., Wen, G.: A modified support vector machine and its application to image segmentation. Image Vis. 29(1), 29–40 (2016) 16. Hsuan, T.L., Chih, J.L.: A study on Sigmoid Kernels for SVM and the training non-PSD kernels by SMO-type methods. Submitt. Neural Comput. 27(1), 15–23 (2003)

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost Haojun Ai1,2,3 , Liangliang Han4 , Yifeng Wang1(B) , and Liang Liao5,6 1

School of Cyber Science and Engineering, Wuhan University, Wuhan, Hubei, China {aihj,whuyifeng}@whu.edu.cn 2 Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Beijing, China 3 Collaborative Innovation Center of Geospatial Technology, Wuhan, China 4 Aerospace System Engineering Shanghai, Shanghai, People’s Republic of China 5 ChangZhou Municipal Public Security Bureau, Changzhou, China 6 Key Laboratory of Police Geographic Information Technology, Ministry of Public Security, Beijing, China

Abstract. Acoustic gesture recognition based on the Doppler effect has garnered much research attention. The accuracy of gesture recognition and potential false positives are the main factors that limit the widespread use of gestures. To this end, we propose a novel gesture classification method based on the acoustic Doppler effect that does not require any custom hardware, simply a speaker and one microphone on a laptop. An effective sound field is built by a high frequency sound wave from the speaker, and the wave reflected by hand motion is captured by the microphone. We design a set with five features, three of them are stable and invariant to different people, so even new users can operate our system with zero start-up cost and no training. The remaining two features are highly correlated with the velocity and the range to computer of the gestures, which can reduce the potential false positives in detection. Besides, a classifier is designed depending on multistage decision rules to identify the 11 kinds of defined gestures. The experiment result about user experience feedback of HCI shows that our system has good usability performance. And the numerical experiments with 10 users show that our system can not only keep less potential false positives, but also achieve a classification accuracy of up to 99.09%. Keywords: Doppler effect

1

· Gesture classification · Acoustic · HCI

Introduction

For years, gesture recognition [2,9,10,17] with a device-free manner has developed rapidly, especially the widely used mobile phones and PCs have audio input and output components composed of speakers and microphones, so a Dopplerbased gesture recognition as a new human-machine interface application has attracted the attention of researchers [1,7,8,14,15]. c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 44–58, 2018. https://doi.org/10.1007/978-3-030-05057-3_4

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost

45

Many studies have tried to use machine learning methods [11,13,19–21] to improve the accuracy of gesture recognition. For example, Ai et al. [1] obtained the HMM model of each gesture by training the feature vectors of the samples, finally achieved recognition accuracy of 95% for 18 gestures. Dolphin [16] extracted the effective frequency bins around the peak and normalized them to form a vector feature, and the classifier they chosen was the Liblinear (Large Linear classifier) with 93% accuracy of recognition. In addition, Neural Net [11], Bayes [19] etc. classifiers were also used in some researches. Although the classification accuracy of gestures is significantly improved by adopting machine learning, other problems such as increased computational complexity and timeconsuming are caused. Besides, potential false positives of gesture detection are also a key issue that restricts the widespread use of gestures in HCI. Most acoustic-based hand gesture classification methods show good robustness in an unmanned environment [1, 14,16], but if people walk around, they are prone to false positive performance in detection [1,7,16]. In the paper, we extract three general stable invariant features to characterize one gesture and two other features that reduce the false positives in detecting gestures. Furthermore, we also design a classifier depending on multistage decision rules to categorize the 11 predefined gestures with high accuracy and less false positives. We have summarized the main contributions in the paper as follows: – We extract five features from the Doppler shift to characterize a gesture, and design a classifier depending on multistage decision rules to identify all gestures, which keeps a high precision during gesture recognition. – Two of the features are bandwidth and amplitude of shift, which can significantly reflect the velocity and the range to computer of the gestures. By threshold setting can effectively identify some of the far-range people’s walking and slow motions, thereby reducing the potential false positives in detection effectively. – Remaining three features are direction, count of direction change and distance. They are all stable and invariant property in a gesture so that they generally does not change when a same gesture is performed by different people. Hence the users can operate our system with zero start-up cost and no training.

2

Feature Extraction

The theoretical basis of the gesture identification is a well-known phenomenon: Doppler effect [18]. When a moving object approaches the signal source (the speaker), the frequency of signal perceived by the receiver (the microphone) becomes larger [3], whereas the perceived frequency decreases when the object does an operation far from the wave source.

46

H. Ai et al.

The Doppler shift fR caused by a movement can be calculated by the equation:   Δv fR = 1 + ×f (1) c Δf = fR − fS

(2)

Where Δv and c respectively represent the velocity of the object and of the sound in air, and fS is the pilot tone transmitted from the speaker. Since the speaker and microphone keep stationary and are located on a same laptop, the velocity of receiver and source is out of our consideration. 2.1

Signal Analyze

In this paper, a range of effective sound field is formed by a high-frequency signal of 18 kHz from the speaker. When the operator moves hands in it, the reflected frequency shift is captured by the microphone. According to the characteristics of Doppler frequency shift [6], the whole processing of signal is carried out in the frequency domain. We set the sampling frequency of the microphone to 44.1 kHz, and then the 2048-point FFT is performed to obtain the frequency-domain characteristic of the sound. In the informal test of SoundWave [8], the fastest gesture can reach 3.9 m/s. Herein, we conservatively estimate the fastest speed as 6 m/s, that is, the maximum frequency shift Δfmax = 318 Hz is calculated according to the Eq. 1, so take the left effective frequency range of the emitted peak is [17682, 18000], the right effective range is [18000, 18318].

Fig. 1. (a) Positive shift in frequency spectrum generated by a towards-gesture. (b) Time-frequency map caused by a moving hand. The hand moves towards and away from the device alternately from the 4th to 8th s, and no motion in the remaining time.

We set the length of the analysis windows to 50 ms, so the frequency-domain is refreshed every 50 ms. The frequency spectrum is like a micro-image, reflecting

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost

47

the changes in the frequency of gestures in the instantaneous, and contains many tiny details (Fig. 1(a)). A time-frequency graph is generated by adding the time information to the spectrum, as seen in Fig. 1(b), it expresses the direction and distance of gestures at the macro level. 2.2

Feature Extraction

After getting the spectrum of the signal collected by the microphone, we extracted five features, including bandwidth, amplitude of frequency shift, furthermore, the direction and count of direction change, and the moving distance in a gesture, so as to form a feature vector x: T  x = x(1) , x(2) ...x(i) ...x(n)

(3)

Where x(i) represents the ith feature, and n = 5. The overall flow is shown in Fig. 2. Next, we explain each feature of frequency shift in detail. Bandwidth ( x(1) ). x(1) is the bandwidth of emitted peak by scanning the frequency bins at 30% of the tone amplitude, which is extracted with the same method as SoundWave ([8]). x(1) is a measure of the absolute velocity of gesture movement and divide the hand velocity into different levels (Fig. 3). By setting an appropriate threshold θv , false positives caused by unintended slow motions of users can be effectively detected.

Capture ultrasound

Hamming window

FFT

Calculate distance

Determine direction

Extracting bandwidth

Fig. 2. The processing flow of sound signal.

Fig. 3. Bandwidth in frequency spectrum caused by different velocity gestures.

48

H. Ai et al.

Amplitude of Frequency Shift ( x(2) ). x(2) is the highest amplitude that the frequency shift can reach, which is a percentage-based value relative to the amplitude of tone peak Apeak . Shift caused by performing a same gesture at far and near are significantly different, mainly manifested in x(2) , as illustrated in Fig. 4. The farther a gesture is performed, the lower x(2) is. Therefore, setting a higher amplitude hupper , gestures can basically divided into two

Fig. 4. Where L is the distance from the location of the gesture to the computer, V represents the velocity level of a gesture. The noise in the surrounding environment is about 45 dB. (a) No gesture was performed. (b–c) High amplitude shift caused by a gesture performed in near-range, but x(1) bandwidth in (c) is much larger than that in (b); (d–e) Lower amplitude shift caused by a fast gesture in far-range from computer.

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost

49

categories: near-range gesture Gnear and far-range gesture Gf ar . In this paper, we set hupper = 70% × Apeak , it’s obvious that x(2) > hupper in the frequency spectrum of Gnear , but x(2) < hupper of Gf ar . To summarize, x(1) is the bandwidth covered by the frequency shift on the horizontal axis at a specific amplitude, which reflects a gesture velocity. x(2) is the amplitude that the frequency shift can reach on the vertical axis, so gestures can be simply divided into two categories based on the location of gesture from the computer. Identifying the slow velocity or far-range motion as a false alarm can improve system robustness. Direction ( x(3) ). x(3) represents the direction of the gesture, which is depending on the energy difference between the right and left side of the peak. When the shift of the frequency shift is positive, the energy on the right of the peak increases, whereas the negative shift causes the energy on the left side to increases.

Fig. 5. (a) The red line area shows a positive shift occurs on the right of pilot peak, x(3) > 0, meaning a towards-gesture. (b) No movement and no frequency shift, x(3) is near zero. (Color figure online)

Define the energy on the left Elef t as the integral of the frequency within the effective range  fS f (x)dx (4) Elef t = fS −Δfmax

Similarly, define the right energy Eright :  fS +Δfmax f (x)dx Elef t =

(5)

fS

Therefore, the difference between the right and left energy x(3) : x(3) = Eright − Elef t

(6)

50

H. Ai et al.

Where Δfmax = 318 Hz, f (x) is the amplitude of the shift at each effective frequency bin. As illustrated in Fig. 5(a), if x(3) is positive, then the hand moves towards the devices, the negative value means away. No movement occurred if x(3) is near zero (Fig. 5(b)).

Fig. 6. When the frequency shift prop1 to , 2 one time erty goes from  change of hand direction is detected. 2 to  3 are also one Similarly, from  time change.

2 of long distance Fig. 7. The area  1 of short gesgesture is larger than  ture obviously.

Count of Direction Change ( x(4) ). When detecting positive and negative value of x(3) exchanges in a gesture, recorded as one time change of the gesture direction. In Fig. 6, quantity of changes of motion direction x(4) is 5, that is, the frequency shift across the peak intersection marked a change. Distance ( x(5) ). x(5) is calculated by the integration of frequency shift over time, which indicates the moving distance of a gesture in one time direction change to distinguish the long and short distance gesture (Fig. 7). Distance = time × velocity, time information can be quickly obtained from the time-frequency map, the key is velocity. There is a proportional relationship between velocity and frequency shift based on Eq. 2. We use the following equation to make a rough calculation of x(5) :  t2 c x(5) = × Δf dt (7) fS t1 Where t1 and t2 respectively represent the start point and the next direction change point of the gesture within once change of the gesture direction. In Fig. 8, an informal test shows the x(5) distribution of different gestures, where the short and long distance gestures were respectively performed 100 times by 10 participants. The result verified our thoughts that long and short distance gestures have a clear boundary value. So we initially set the threshold DL/S of long and short distance is 500 to make gestures more clearly distinct and make sure a high sensitivity of distinguishing two types gesture.

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost

Quantity of gestures

100

51

Short distance gesture Long distance gesture

80 60 40 20 0 0

200

400

600

Distance x

(5)

800

1000

...

(Hz)

Fig. 8. Histogram of x(5) distribution.

Long distance

Short distance

Towards

Tap Towards

Both hand

Away

Tap Away Both Handed Seesaw

Fig. 9. G1 ∼ G5 graphic representation.

3

Gesture Definition and Classification

The designed gestures are not only the more accessible body language of HCI, but also easily discriminated from each other. 3.1

Gesture Definition

Based on the proposed five features, we can define a simple set that contains 11 gesture actions: G = {G1 , G2 ...Gj ...GN }, where N = 11, Gj represents the jth gesture in the set. All gesture descriptions are listed in Table 1. And Fig. 9 shows G1 ∼ G5 motion graphic, where G1 and G2 are long distance gestures, while the tap gestures like click mouse, so they are all short distance motions. The remaining gestures G6 ∼ G11 are compound gestures. The users need to perform gestures at a certain velocity, without requiring a constant velocity, only need the instantaneous velocity reach the threshold of certain velocity. Users can adjust the velocity threshold according to their own habits. 3.2

Hand Gesture Classification

In this section, we classify gestures step by step based on different features until we categorize each of the gestures. The system first detects G5 (BHS) because

52

H. Ai et al. Table 1. Definition of gestures Number Gesture Description G1

T

Towards: Move hand towards the microphone for long distance

G2

A

Away: Move hand away from the devices for long distance

G3

TT

Tap-Towards: Swipe hand towards then away from, just like clicking a mouse one time, short and quickly

G4

TA

Tap-away: Same action as G3 , in the opposite direction

G5

BHS

Both-Handed-Seesaw: Move both hands from two sides to the middle simultaneously, and then separate

G6

TtA

Towards-then-away: Swipe hand towards for long distance, then away to origin

G7

AtT

Away-then-towards: Same gesture like G6 , only in the opposite direction

G8

DTT

Double-Tap-Towards: Do G3 twice

G9

DTA

Double-Tap-Away: Perform G4 twice

G10

TTT

Triple-Tap-Towards: Perform G3 three times

G11

TTA

Triple-Tap-Away: Do G4 three times

it causes significant shifts on both sides of the tone peak simultaneously, a clear distinction from the remaining 10 gestures. Then, we classify the remaining 10 gestures by using a classifier designed depending on multistage decision rules (Fig. 10). Table 2 lists the feature values of the 10 gestures.

4

Evaluation and Results

We evaluated the system performance experimentally. And the system was developed on a laptop PC with Windows 10 and a pair of microphone and speaker without any customized hardware (Fig. 11), so the direction of any gesture performed by the user is same for the microphone and speaker. Note that any gestures within 0.8 m and people walking within 2 m near the computer all can cause significant frequency shifts, the experimental scene has a noise level of 45 dB. 4.1

Numerical Experiment

We conducted a numerical experiment to the robustness of the system through the following three fields: false positive, false negative and classification accuracy.

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost

53

Fig. 10. Identifying gestures using a classifier, when a gesture is detected, the classifier adopts the features x(4) , x(3) and x(5) in turn as decision rule for each stage.

Fig. 11. Devices deployment in experiment environment.

Potential False Positive. A false alarm refers to a gesture is erroneously detected without gestures execution. Experiments were conducted in the following two common living environments, the first is that the user only sat in front of the computer for normal typing and thinking motion, while no one walks around. In half an hour, the number of potential false positives is 6, all of them were single tap actions since these gestures is short and simple. In the second case, the user had no any actions, only three participants were located about 1.5 m from the computer and walked around for half an hour. The system detected 4 false positives finally, and all of them were the result of the participants walking quickly.

54

H. Ai et al. Table 2. The list of features for all gestures Gestures x(3)

x(4) x(5)

T

Towards 0

Long distance(L)

A

Away

L

TT

Towards 1

TA

Away

1

S

TtA

Towards 1

L

0

Short distance(S)

AtT

Away

1

L

DTT

Towards 3

S

DTA

Away

3

S

TTT

Towards 5

S

TTA

Away

S

5

False Negative. A false negative means no gesture is detected while a conscious gestures is performed actually. In our experiment, 10 users (marked U 1 ∼ U 10) actively participated and performed each gestures 100 times, resulting 11000 (100 × 11 × 10) gesture samples. Several false negative errors occurred in the process, as shown in Fig. 12. The false negative rate of nine users are all less than 1%, but the U2 rate as high as 1.1% (Fig. 12). Why? We found a set of interesting data, U2 tends to move in parallel with four fingers instead of sliding the palm of hand, resulting in smaller frequency shift. This may be the reason of a high false negative rate.

TA

TtA

AtT

DTT

DTA

TTT

TTA

0

BHS

A

T

T 100 0

0

0

0

0

0

0

0

0

100 0 0 0 0 0 TT 0 0 99 0 0 1 0 TA 0 0 0 98 0 0 2 BHS 0 0 0 0 100 0 0 TtA 0 0 3 0 0 97 0 AtT 0 0 0 4 0 0 96 A 0

0.8

Actual Gestures

False negative rate(%)

1

TT

Classified Gestures

1.2

0.6 0.4

DTT 0

0

0

0

0

0

0.2

DTA 0

0

0

0

0

0

TTT 0

0

0

0

0

0

0

TTA 0

0

0

0

0

0

1

2

3

4

5

6

7

Gestures

8

9 10 11

Fig. 12. The rate of false negative during the gesture sample test process.

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

100 0 0 0 0 100 0 0 0 0 0 100 0 0 0 0 0 100 0

0

Overall recognition rate = 99.09% Fig. 13. The confusion matrix of the gesture classification.

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost

55

Classification Accuracy. Then the rest of effective gesture successfully detected from above experiment samples were used to measure classification precision. Since the samples were all labeled, so we can easily calculated the final classification accuracy (Fig. 13) up to 99.09%. There are several samples that have been misidentified, mostly because of the occasional confusion in decision of long or short distance of gestures, as different people have their own preferences to perform hand gestures, so it is very difficult to correctly classify gestures with 100% accuracy by choosing a proper threshold DL/S of long and short distance. However it doesn’t mean the evaluation contradicts with the claim, because the experimental result has shown that our method can already identify the different distance gestures with a much high accuracy. 4.2

Gesture Usability Test

Research in gesture usability focus on five main principles [4,5,12]: learnability, efficiency, memorability, errors, and coverage. Among them, the low error rate (99.09% accuracy) and coverage (zero start-up cost and no training) have been basically verified in Sect. 4.1. Next, we mapped the gesture set to a common remote controller in our life, take MI smart TV remote controller as an example (Fig. 14). Each gesture operates a button, there are 11 buttons on the controller, corresponding to our 11 kinds of gestures. 10 users (U 1 ∼ U 10) respectively performed gestures to simulate remote controller to direct MI TV freely. We collected a total of 151 gesture samples from 10 users, where 2 missed detection and 1 misidentified. We further recorded the user experience to evaluate the usability of the system for gesture classification. Each participant indicated that the system is particularly efficient, as they can smoothly operate the TV with high precision. Six participants remarked

Fig. 14. MI TV remote controller.

56

H. Ai et al.

specifically on the’learnability’, since they were asked to observe the demo and learn gestures for 2–3 min and then operate the TV. Besides, eight participants described the gestures as “memorability” and “learnability”, since the meaning of the gestures are easy to understand, so they can remember them (and perform them) easily. However, two participants acknowledged that the gesture action and the function of the menu are not very relevant, increasing the memory burden. Finally, our method shows better performance in many items (Table 3) by comparison with the state of the art. A computer with one speaker and a microphone can meet our all hardware requirements. In addition, all experiments do not require users to perform gesture samples in advance and no training. Meanwhile, the results of digital experiments have verified that our system is robust. It not only has less potential false positives, but also can keep the false negative rate within 1%, and finally achieve about 99% classification accuracy with the defined 11 gestures. Table 3. Comparison to the existing sound-based methods Methods

SoundWave [8] Dolphin [16] Multiwave [15] Our method

Number of speakers

1

1

≥2

1

Needing training?

NO

YES

YES

NO

Improve false positives? YES

NO

NO

YES (>SoundWave)

Test false negatives?

NO

NO

NO

YES

Accuracy

94.5%

93%

93.9%

99%

5

Conclusion

In this paper, we proposed a gesture set for HCI based on Doppler effect. The sound field consists of a pair of speaker and microphone. The reflected signal by moving gesture is captured by a microphone. We extract five most robust features from the Doppler shift, and classify a gesture set containing 11 gestures by a classifier based-on multistage decision rules. Compared with the state-of-the-art, the features we propose can be better improve the bad effects of potential false positives, especially our method can achieve a high accuracy during classifying all gestures with no training. Finally, the results of experiments illustrate that our gesture set performs very well on usability, including high accuracy, less false positives, learnability, memorability and zero start-up cost. Acknowledgment. We thank the participants for participating the user study. This work is partially supported by The National Key Research and Development Program of China (2016YFB0502201).

Accurate Acoustic Based Gesture Classification with Zero Start-Up Cost

57

References 1. Ai, H., Men, Y., Han, L., Li, Z., Liu, M.: High precision gesture sensing via quantitative characterization of the doppler effect. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 973–978. IEEE (2016) 2. Asadzadeh, P., Kulik, L., Tanin, E.: Gesture recognition using RFID technology. Pers. Ubiquit. Comput. 16(3), 225–234 (2012) 3. Aumi, M.T.I., Gupta, S., Goel, M., Larson, E., Patel, S.: Doplink: using the doppler effect for multi-device interaction. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 583–586. ACM (2013) 4. Bevan, N., Curson, I.: Methods for measuring usability. In: Howard, S., Hammond, J., Lindgaard, G. (eds.) Human-Computer Interaction INTERACT 1997. ITIFIP, pp. 672–673. Springer, Boston, MA (1997). https://doi.org/10.1007/9780-387-35175-9 126 5. Cabral, M.C., Morimoto, C.H., Zuffo, M.K.: On the usability of gesture interfaces in virtual reality environments. In: Proceedings of the 2005 Latin American Conference on Human-Computer Interaction, pp. 100–108. ACM (2005) 6. Chen, K.Y., Ashbrook, D., Goel, M., Lee, S.H., Patel, S.: Airlink: sharing files between multiple devices using in-air gestures. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 565– 569. ACM (2014) 7. Fu, B., Karolus, J., Grosse-Puppendahl, T., Hermann, J., Kuijper, A.: Opportunities for activity recognition using ultrasound doppler sensing on unmodified mobile phones. In: Proceedings of the 2nd international Workshop on Sensor-based Activity Recognition and Interaction, p. 8. ACM (2015) 8. Gupta, S., Morris, D., Patel, S., Tan, D.: Soundwave: using the doppler effect to sense gestures. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1911–1914. ACM (2012) 9. Jeong, J., Jang, Y.: Max-min hand cropping method for robust hand region extraction in the image-based hand gesture recognition. Soft Comput. 19(4), 815–818 (2015) 10. Kellogg, B., Talla, V., Gollakota, S.: Bringing gesture recognition to all devices. NSDI 14, 303–316 (2014) 11. Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7 (2015) 12. Nielsen, M., St¨ orring, M., Moeslund, T.B., Granum, E.: A procedure for developing intuitive and ergonomic gesture interfaces for HCI. In: Camurri, A., Volpe, G. (eds.) GW 2003. LNCS (LNAI), vol. 2915, pp. 409–420. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24598-8 38 13. Paramonov, P., Sutula, N.: Simplified scoring methods for HMM-based speech recognition. Soft Comput. 20(9), 3455–3460 (2016) 14. Pittman, C., Wisniewski, P., Brooks, C., LaViola Jr, J.J.: Multiwave: doppler effect based gesture recognition in multiple dimensions. In: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 1729–1736. ACM (2016) 15. Pittman, C.R., LaViola Jr, J.J.: Multiwave: complex hand gesture recognition using the doppler effect. In: Proceedings of the 43rd Graphics Interface Conference, pp. 97–106. Canadian Human-Computer Communications Society (2017)

58

H. Ai et al.

16. Qifan, Y., Hao, T., Xuebing, Z., Yin, L., Sanfeng, Z.: Dolphin: ultrasonic-based gesture recognition on smartphone platform. In: 2014 IEEE 17th International Conference on Computational Science and Engineering (CSE), pp. 1461–1468. IEEE (2014) 17. Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015) 18. Seddon, N., Bearpark, T.: Observation of the inverse doppler effect. Science 302(5650), 1537–1540 (2003) 19. Suk, H.I., Sin, B.K., Lee, S.W.: Hand gesture recognition based on dynamic bayesian network framework. Pattern Recogn. 43(9), 3059–3072 (2010) 20. Xiao, Q., Siqi, L.: Motion retrieval based on dynamic Bayesian network and canonical time warping. Soft Comput. 21(1), 267–280 (2017) 21. Xiao, Q., Song, R.: Motion retrieval based on motion semantic dictionary and HMM inference. Soft Comput. 21(1), 255–265 (2017)

An Approach of Collecting Performance Anomaly Dataset for NFV Infrastructure Qingfeng Du1,2 , Yu He1,2(B) , Tiandi Xie1,2 , Kanglin Yin1,2 , and Juan Qiu1,2 1

School of Software Engineering, Tongji University, Shanghai, China {du cloud,rainlf,xietiandi,14 ykl,Juan qiu}@tongji.edu.cn 2 Software Engineering R&D Centre, Tongji University, Jishi Building, Shanghai, China https://github.com/XLab-Tongji

Abstract. Network Function Virtualization (NFV) technology is widely used in industry and academia. Meanwhile, it brings a lot of challenges to the NFV applications’ reliability, such as anomaly detection, anomaly location, anomaly prediction and so on. All of these studies need a large number of anomaly data information. This paper designs a method for collecting anomaly data from Infrastructure as a Service (IaaS), and constructs an anomaly database for NFV applications. Three types of anomaly datasets are created for anomaly study, including datasets of workload with performance data, fault-load with performance data and violation of Service Level Agreement (SLA) with performance. In order to simulate an anomaly in a production environment better, we use Kubernetes to build a distributed environment, and to accelerate the occurrence of anomalies, a fault injection system is utilized. Our aim is to provide more valuable anomaly data for reliability research in NFV environments. Keywords: Anomaly database · NFV · Kubernetes · IaaS Clearwater · Performance monitoring · Fault injection

1

Introduction

Network Function Virtualization (NFV) is becoming more and more popular. Many Communication Service Providers (CSP) have begun to migrate applications to Network Functions Virtualization (NFV) environment [1]. Detection of anomaly and anomaly location is very important for providing better network services. It is necessary to predict anomalies in some special circumstances. It needs to analyze the rules and connections in a large number of anomaly data. But in production environment, the cost of collecting these data is expensive. So it is meaningful to collect these anomaly data for research in the experimental environment.

c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 59–71, 2018. https://doi.org/10.1007/978-3-030-05057-3_5

60

Q. Du et al.

At present, there are many databases for anomaly data, such as KDD CUP 99 dataset1 , NAB dataset2 , Yahoo Webscope S5 dataset3 , and so on. All of these could be a benchmark for evaluating algorithms for anomaly detection. But these datasets also exist some restrictions, like single label, data redundancy and so on. On this basis, we collect anomaly data from three different perspectives. In NFV environment, the cause of the failure is not single. In order to describe different exceptions more accurately, the multiple types of fault tags are necessary. Our method uses fault injection system to specify fault types of anomaly data, making datasets more suitable to deal with the problem of multiple classification in machine learning [2]. In addition, the malfunction of system resources can also lead to system anomaly happen, the pressure of users on system workload will also lead to system anomaly behavior [3]. In production environment, increase of users may be an important factor leading to anomaly service compared to the occurrence of hardware anomaly events. Our method also collects anomaly data under different workload. In NFV applications, the typical quality of service index is Service Level Agreement (SLA)4 . When a violation of SLA occurs, it represents an anomaly service. Our method also collects performance data under different SLA level. It helps researcher to analyze the relationship between a occurrence of SLA violation and performance data of IaaS in a system. At last, we propose several machine learning models based on supervised learning to detect SLAs of VNFs and anomaly in IaaS. And compare the experimental results of each model. The result of the comparison between the models show that our anomaly database has a certain reference value in the anomaly detection with VNFs Environment. The paper is organized as follows: Sect. 2 introduces the technical background and our related work in the construction of the anomaly database. Section 3 introduces the architecture of the data collection. Section 4 shows the implementation of our experiment. Section 5 provides a classical case study of Clearwater project5 , gives a detailed description of the building of the anomaly database. And at last, we summarizes the contribution and discuss the future work in Sect. 6.

2

Background and Related Work

With the development of Internet applications and the maturity of hardware virtualization, The emergence of Infrastructure as a Service (IaaS) [4] provides the underlying hardware support for this architecture. It makes network providers do not need care about the details of the underlying hardware devices, and 1 2 3 4 5

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. https://github.com/numenta/NAB. https://webscope.sandbox.yahoo.com/catalog.php?datatype=s. https://en.wikipedia.org/wiki/Service-level agreement. http://www.projectclearwater.org/.

An Approach of Collecting Performance Anomaly Dataset

61

concentrate on providing upper level services. In this context, Virtual Network Functions (VNFs) represent any virtual execution environment configured to provide a given network service. VNFs are often structured in several components each one hosted on single VMs. The existing anomaly databases collect a lot of anomaly data in different fields. KDD CUP 99 dataset is used for network attack diagnosis. Each of its data records whether or not it has been attacked at the moment. It means that there are only one label in dataset, normal or anomaly. Even Mahbod Tavallaee and his collaborator further optimized KDD CUP 99 dataset called NSL-KDD, it still has the same limitations [5]. This paper provides a disturbance system to specify the type of fault load to analyze the influence of different fault types on the performance of the tested system. Markus Thill present a comparative study where several online anomaly detection algorithms are compared on the large Yahoo Webscope S5 anomaly benchmark [6]. But the yahoo Webscope S5 dataset is more suitable for time series analysis. It continues to have some limitations for the classification of different faults. We present a new approach to collecting performance data that with fault label. It has more advantages in the classification problem of anomaly detection. In this paper, we integrate common single fault time series analysis problems and multiple fault classification problems in complex systems, propose corresponding performance data collection system and disturbance system. Then establish varied dataset in our anomaly database, Provide reference for fault analysis in different scenes. The details is shown in our site6 .

3

Architecture of Data Collection

This section outlines the framework of our performance data collection. In order to accurately collect data that with a fault type label, the framework consists of three systems, target application system (target system), disturbance system and performance monitoring system (monitoring system), as shown in Fig. 1. 3.1

Target System

Target system is a NFV application system, which is software implementations of network functions that can be deployed on a network functions virtualization infrastructure (NFVI). NFVI is the totality of all hardware and software components that build the environment where VNFs are deployed. 3.2

Disturbance System

The core function of the disturbance system is fault injection [7,8], it is used to accelerate the occurrence of anomaly events in the target system, such as 6

https://github.com/XLab-Tongji.

62

Q. Du et al.

Fig. 1. Architecture of the performance data collection

hardware performance bottlenecks, SLA violation and so on. In this paper, we use linux system stress tool called stress-ng [9] to simulation system pressure to achieve fault injection function. In order to produce different types of disturbance to the system, we use different types of fault injection in the target system: – CPU stress fault – MEMORY stress fault – IO stress fault Every type of fault injection will consume the system resources as much as possible to ensure the occurrence of anomaly events. In most situations, anomaly diagnosis of platforms or systems is often directed against single point failure [10]. So we use a strategy to ensure that only one type of disturbance occurs on only one virtual machine at the same time. When fault injection occurs, the disturbance system will record the log of the fault injection at the same time, including the start time, the duration, the type of fault and the target virtual machine. After the monitoring system collects performance data, the logs can be used to tag the performance data. 3.3

Monitoring System

There are many kinds of mature IaaS layer monitoring schemes at present, like Zabbix7 , Nagios8 , Cacti9 . Considering our experimental environment and 7 8 9

https://www.zabbix.com/. https://www.nagios.org/. https://www.cacti.net/.

An Approach of Collecting Performance Anomaly Dataset

63

monitoring project items, we use Zabbix to monitor the system and collect performance data online. Zabbix is an enterprise open source monitoring software for networks and applications with C/S model, the zabbix agent is installed in the VMs. The situation shows that agent monitoring is more accurate than agent-less monitoring, and can more accurately describe the performance model of a system [11]. The Table 1 shows the performance model in our approach. Zabbix agents will collect these metrics from VMs, and store them in it’s MySQL database. We also offer a JAVA application to download these performance data throw RESTful API from Zabbix server. Table 1. Zabbix monitoring metrics

4

Implementation

This section presents the implementation of our test bed environment. It includes infrastructure, kubernetes platform, monitoring system, attacker system and the clearwater-docker NFV application running in kubernetes platform, as shown in Fig. 2. 4.1

Infrastructure

The virtualized platform is a VMWare ESXI machine with 64 CPUs, 128 GB memory and 2 TB disk. It can provide multiple virtual machines on a physical machine. In this paper, we create 10 VMs on it. Every VM has 2 CPUs, 8 GB memory and 20 GB disk. VMs are connected through a 1000 Mbps virtualized network. The VMs has the docker environment with version 17.03.2-ce that can deploy most docker container in it.

64

Q. Du et al.

4.2

Kubernetes

Kubernetes is a powerful container management platform. We use it to deploy the Clearwater project as described below. Here we use the Rancher scheme10 to deploy kubernetes platform on the VMs. The reason is it can easily deploy the kubernetes platform. The installation steps are described as following: 1. Confirm that the network between the virtual machines just created is working; 2. Select a host as the rancher server host and deploy the latest version of rancher docker image on it; 3. Waiting for the rancher server is running Correctly, access the rancher server page from the 80 port of the host; 4. Create a new environment for test bed based on kubernetes template; 5. Add all other VMs in this environment and wait rancher server add them to kubernetes platform automatically.

Fig. 2. Deployment of the test bed

4.3

Monitoring and Attack System

The monitoring system consists of zabbix server host and zabbix agents. Zabbix agents were installed on each VM when they were created and connect to zabbix server through the web page configurations. When the connection is set up, the agent will began to collect performance data and report them to the server at a set time interval. Attacker host is also an independent host. It will execute the attack scripts which we provided to perform fault injection into VMs. 10

https://rancher.com/.

An Approach of Collecting Performance Anomaly Dataset

4.4

65

NFV Application

The NFV application is a distributed computing system running NFV application. Here we utilise the Clearwater project. It is an open source implementation of an IMS for cloud platforms. It provides SIP-based (Session Initiation Protocol) voice and video calling, and messaging applications. It implements key standardized interfaces and functions of an IMS (except a core network) which enable industries to easily deploy, integrate and scale an IMS [3]. Clearwater project is consequently well suited for NFV related studies, it consists of about 10 components, every component plays its own unique functions in the system, and the relationship between components is shown as Fig. 3. Due to the docker deployment scheme, every Clearwater docker container is configured to allow unlimited use of host resources.

Fig. 3. Architecture of the clearwater project

Bono (Edge Proxy): The Bono nodes form a horizontally scalable SIP edge proxy providing both a SIP IMS Gm compliant interface and a WebRTC interface to clients. Client connections are load balanced across the nodes. The Bono node provides the anchor point for the client’s connection to the Clearwater system, including support for various NAT traversal mechanisms. A client is therefore anchored to a particular Bono node for the duration of its registration, but can move to another Bono node if the connection or client fails. Sprout (SIP Router): The Sprout nodes act as a horizontally scalable, combined SIP registrar and authoritative routing proxy, and handle client

66

Q. Du et al.

authentication and the ISC interface to application servers. The Sprout nodes also contain the in-built MMTEL application server. Dime (Diameter Gateway): Dime nodes run Clearwater’s Homestead and Ralf components. Homestead (HSS Cache) provides a web services interface to Sprout for retrieving authentication credentials and user profile information. It can either master the data (in which case it exposes a web services provisioning interface) or can pull the data from an IMS compliant HSS over the Cx interface; Ralf provides an HTTP API that both Bono and Sprout can use to report billable events that should be passed to the CDF (Charging Data Function) over the Rf billing interface. Vellum (State Store): Vellum is used to maintain all long-lived state in the deployment. It does this by running a number of cloud optimized, distributed storage clusters including Cassandra, etcd, Chronos and Memcached. Homer (XDMS): Homer is a standard XDMS used to store MMTEL service settings documents for each user of the system. Ellis: Ellis is a sample provisioning portal providing self sign-up, password management, line management and control of MMTEL service settings. As introduced before, the Bono, Sprout, and Homestead are the Core modules in the Clearwater project, they are working together to control sessions initiated by users. So our data collection work is mainly focused on these three modules. When experiment begins, Clearwater is running normally to generate normal data, or running overloaded to generate anomaly data. When system is running normally, the attacker host can execute attack to disturb system to produce anomaly data and record the log. While the monitoring system is monitoring the VMs performance metrics and collect all normal and anomaly data on it to establish the database.

5

Case Study

This section introduces a classic Clearwater case study. On the basis of the normal operation of system, disturbed the system by overload work stress and fault injection respectively to produce the anomaly dataset. And select the machine learning algorithm with better performance in anomaly detection [12–15] to verify the availability of datasets. In order to produce a normal workload, use the official recommended tools clearwater-sip-stress-coreonly11 . It can control the working stress of the system by specifying three parameters as: – subscriber count: the number of subscribers to emulate; – duration: the number of minutes to run stress for; – multiplier: Optional parameters, multiplier for the VoLTE load profile (e.g. the default is 1 means 1.3 calls and 24 re-registers per sub per hour; passing 2 here will mean 2.6 calls and 4 re-registers per sub per hour). 11

https://clearwater.readthedocs.io/en/stable/Clearwater stres testing.html.

An Approach of Collecting Performance Anomaly Dataset

67

We chose 500 subscribers, 60 min and 450 multiplier for experiment, At this point, the system can reach a 100% successful call rate. When the work stress continues to increase, the successful call rate began to decline. So we mark this point as a engineering level point x, it means the system has running in full workload under the current configuration. 5.1

Workload Module

As described above, we use engineering level point x as a standard to produce workload. Test the performance data of the system under 0.8x, 1x, 1.5x, 2x and 2.5x pressure respectively. The structure of collected dataset is shown in the Table 2. 5.2

Faultload Module

In this paper, we forces on the single point fault, it means at the same time, there is only one type of fault be injected into one VM. 0.8x engineering level is chosen to be the normal system running workload to easily observe the anomaly representation generated by fault injection. The process of fault injection is shown in Fig. 4.

Fig. 4. Fault injection process

Within a specified time period, the fault injecting program will select a Select random fault type, a random target virtual machine, and a random injection period to start a disturbance process. This process will continue until the total of time which fault injection consumed reaches the stipulated time period. As described in Algorithm 1. The disturbance system also records the injected log while injecting the fault. The key information includes timestamp, fault type, target host and injection duration. As Algorithm 2 described, We use the fault injection log to indicate which fault injection stage each performance data record belongs to, like normal, cpu fault, memory fault or io fault. The result of data process is shown in Table 3.

68

Q. Du et al.

Algorithm 1. Fault Inject Controller Input: vm list, inject type list, duration list, duration 1: timer = 0 2: while timer < duration do 3: inject vm = random(vm list) 4: inject type = random(inject type list) 5: inject duration = random(duration list) 6: timer+ = inject duration 7: inject(vm, inject type, inject duration) 8: sleep(pause) 9: end while

In order to collect the anomaly SLA data, the workload module and faultload module work together to disturbance the system. We calculate the SLA level of the system from the percentage of successful requests (PSR). When P SR ≥ 90%, means the system is in good condition, marked as level 2. When 50% ≤ P SR ≤ 90%, means the system is in unhealthy condition, marked as level 1. When P SR ≥ 50%, means the system is in bad condition, mark as level 0. The structure of dataset is shown in Table 4. Table 2. Dataset A Timestamp Vm1metric2

Vm1metric1

... Vm2metric1

1521448560 70%

73%

... 69%

1521448565 73%

73%

... 68%

99%

... 97%

Vm2metric2

... Vm3metric1

Vm3metric2

... Workload level

77%

... 66%

69%

... 1

75%

... 70%

74%

... 1

100%

... 95%

97%

... 2

...

...

1521458230 98%

5.3

Dataset Verification

This part introduces four widely used machine learning algorithms, namely, support vector machine, nearest neighbor, naive Bayes and random forests. And use them to locate outliers in the system performance data.

Algorithm 2. Data Labeled Controller Input: perf ormance data, injection log 1: labeled data = [] 2: while perf ormance data.has next()! = null do 3: data = perf ormance data.next() 4: data label = label(data, injection log) 5: labeled data.append(data label) 6: end while

An Approach of Collecting Performance Anomaly Dataset

69

Table 3. Dataset B Timestamp Vm1Vm1... Vm2Vm2... Vm3Vm3... Normal CPU MEMORY IO metric2 metric1 metric1 metric2 metric1 metric2 152263940

70%

73%

... 69%

77%

... 66%

69%

... 1

0

0

152263945

73%

73%

... 68%

75%

... 70%

74%

... 1

0

0

0

152263950

73%

100%

... 69%

79%

... 72%

73%

... 0

1

0

0

71%

74%

... 70%

75%

... 99%

72%

... 0

0

0

1

...

0

...

152267680

Table 4. Dataset C Timestamp Vm1Vm1... Vm2Vm2... Vm3Vm3... SAL metric2 metric1 metric1 metric2 metric1 metric2 level 1521448560 90%

72%

... 92%

74%

... 85%

91%

... 2

1521448565 85%

77%

... 83%

75%

... 73%

88%

... 1

68%

... 92%

89%

... 87%

79%

... 0

...

...

1521458230 66%

Table 5. Validation results of anomaly dataset Service

Measure

Nearest neighbors SVM Naive bayes Random forset

Dataset A Precision 0.98 Recall 0.97 F1-score 0.97

0.89 0.88 0.87

0.95 0.93 0.93

0.97 0.96 0.98

Dataset B Precision 0.93 Recall 0.92 F1-score 0.93

0.90 0.91 0.89

0.96 0.95 0.97

0.99 0.98 0.99

Dataset C Precision 0.94 Recall 0.97 F1-score 0.96

0.87 0.93 0.92

0.89 0.91 0.94

0.98 0.96 0.97

There are 737 records in dataset A and dataset B, we employed the first 80% of them as the train set, having trained the learning methods, the rest 20% are used as test set to validate the algorithm model. The validation result are shown in Table 5. The results show that the accuracy, recall rate and F1-score of each model reach a higher value. And because of the multi classification problem of the dataset, the random forest model achieves the best results.

6

Conclusion and Future Work

In this paper, we describe an approach to deploy NFV application Clearwater projects through the Kubernetes platform. On this basis, we use disturbance application system and monitoring system to collect performance data of IaaS

70

Q. Du et al.

layer devices under NFV application scenario to build anomaly database. Three categories of anomaly datasets with specified label are collected, includes workload with performance data, faultload with performance data and SLA level with performance data. The details of the anomaly database can be accessed on our website12 . Through some widely used machine learning algorithm, we verify these datasets and get high accuracy. This means these datasets have some reference value for anomaly detection. In the future, we will try more anomaly scenes and cause anomaly reasons, and build corresponding anomaly datasets to analyze them. We hope to be of certain guiding significance for the detection of anomaly in different scenes.

References 1. Liu, J., Jiang, Z., Kato, N., Akashi, O., Takahara, A.: Reliability evaluation for NFV deployment of future mobile broadband networks. IEEE Wirel. Commun. 23(3), 90–96 (2016) 2. Pieters, M., Wiering, M.: Comparison of machine learning techniques for multilabel genre classification. In: Verheij, B., Wiering, M. (eds.) BNAIC 2017. CCIS, vol. 823, pp. 131–144. Springer, Cham (2018). https://doi.org/10.1007/978-3-31976892-2 10 3. Sauvanaud, C., Lazri, K., Kaˆ aniche, M., Kanoun, K.: Anomaly detection and root cause localization in virtual network functions. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 196–206. IEEE (2016) 4. Bhardwaj, S., Jain, L., Jain, S.: Cloud computing: a study of infrastructure as a service (IAAS). Int. J. Eng. Inf. Technol. 2(1), 60–63 (2010) 5. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD cup 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, pp. 1–6. IEEE (2009) 6. Thill, M., Konen, W., B¨ ack, T.: Online anomaly detection on the webscope S5 dataset: a comparative study. In: 2017 Evolving and Adaptive Intelligent Systems (EAIS), pp. 1–8, May 2017 7. Natella, R., Cotroneo, D., Madeira, H.S.: Assessing dependability with software fault injection: a survey. ACM Comput. Surv. (CSUR) 48(3), 44 (2016) 8. Delvaux, J., Verbauwhede, I.: Fault injection modeling attacks on 65 nm arbiter and RO sum PUFs via environmental changes. IEEE Trans. Circuits Syst. I: Regular Papers 61(6), 1701–1713 (2014) 9. King, C.: Stress-ng (2018) 10. Wang, Y., Li, X.: Achieve high availability about point-single failures in openstack. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 01, pp. 45–48, December 2015 11. Aversa, R., Panza, N., Tasquier, L.: An agent-based platform for cloud applications performance monitoring. In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 535–540, July 2015

12

https://github.com/XLab-Tongji/ADNFVI.

An Approach of Collecting Performance Anomaly Dataset

71

12. Buczak, A.L., Guven, E.: A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 18(2), 1153– 1176 (2016). Secondquarter 13. Iglesias, F., Zseby, T.: Analysis of network traffic features for anomaly detection. Mach. Learn. 101(1–3), 59–84 (2015) 14. Kulkarni, A., Pino, Y., French, M., Mohsenin, T.: Real-time anomaly detection framework for many-core router through machine-learning techniques. ACM J. Emerg. Technol. Comput. Syst. (JETC) 13(1), 10 (2016) 15. Erfani, S.M., Rajasegarar, S., Karunasekera, S., Leckie, C.: High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recogn. 58, 121–134 (2016)

An Axiomatization for BSP Algorithms Yoann Marquer and Fr´ed´eric Gava(B) Laboratory of Algorithms, Complexity and Logic (LACL), University of Paris-East, Cr´eteil, France [email protected], [email protected]

Abstract. The gurevich’s thesis stipulates that sequential abstract state machines (asms) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (bsp) bridging model is a well known model for hpc algorithm design. It provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. The assumptions of the bsp model are thus provide portable and scalable performance predictions on most hpc systems. We follow gurevich’s thesis and extend the sequential postulates in order to intuitively and realistically capture bsp algorithms. Keywords: bsp Cost model

1 1.1

· asm · Parallel algorithm · hpc · Postulates

Introduction Context of the Work

Nowadays, hpc (high performance computing) is the norm in many areas but it remains more difficult to have well defined paradigms and a common vocabulary as it is the case in the traditional sequential world. The problem arises from the difficulty to get a taxonomy of computer architectures and frameworks: there is a zoo of definitions of systems, languages, paradigms and programming models. Indeed, in the hpc community, several terms could be used to designate the same thing, so that misunderstandings are easy. We can cite parallel patterns [5] versus algorithmic skeletons [8]; shared memory (pram) versus thread concurrency and direct remote access (drma); asynchronous send/receive routines (mpi, http:// mpi-forum.org/) versus communicating processes (π-calculus). In the sequential world, it is easier to classify programming languages within their paradigm (functional, object oriented, etc.) or by using some properties of the compilers (statically or dynamically typed, abstract machine or native code execution). This is mainly due to the fact that there is an overall consensus on what sequential computing is. For them, formal semantics have been often studied and there are now many tools for testing, debugging, cost analyzing, software engineering, etc. In this way, programmers can implement sequential algorithms using these languages, which characterize properly the sequential algorithms. c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 72–88, 2018. https://doi.org/10.1007/978-3-030-05057-3_6

An Axiomatization for BSP Algorithms

73

This consensus is only fair because everyone informally agrees to what constitutes a sequential algorithm. And now, half a century later, there is a growing interest in defining formally the notion of algorithms [10]. Gurevich introduced an axiomatic presentation (largely machine independent) of the sequential algorithms in [10]. The main idea is that there is no language that truly represents all sequential algorithms. In fact, every algorithmic book presents algorithms in its own way and programming languages give too much detail. An axiomatic definition [10] of the algorithms has been mapped to the notion of abstract state machine (asm, a kind of Turing machine with the appropriate level of abstraction): Every sequential algorithm can be captured by an asm. This allows a common vocabulary about sequential algorithms. This has been studied by the asm community for several years. A parallel computer, or a multi-processor system, is a computer composed of more than one processor (or unit of computation). It is common to classify parallel computers (flynn’s taxonomy) by distinguishing them by the way they access the system memory (shared or distributed). Indeed, the memory access scheme influences heavily the programming method of a given system. Distributed memory systems are needed for computations using a large amount of data which does not fit in the memory of a single machine. The three postulates for sequential algorithms are mainly consensual. Nevertheless, to our knowledge, there is not such a work for hpc frameworks. First, due to the zoo of (informal) definitions and second, due to a lack of realistic cost models of common hpc architectures. In hpc, the cost measurement is not based on the complexity of an algorithm but is rather on the execution time, measured using empirical benchmarks. Programmers are benchmarking load balancing, communication (size of data), etc. Using such techniques, it is very difficult to explain why one code is faster than another and which one is more suitable for one architecture or another. This is regrettable because the community is failing to obtain some rigorous characterization of sub-classes of hpc algorithms. There is also a lack of studying algorithmic completeness of hpc languages. This is the basis from which to specify what can or cannot be effectively programmed. Finally, taking into account all the features of all hpc paradigms is a daunting task that is unlikely to be achieved [9]. Instead, a bottom up strategy (from the simplest models to the most complex) may be a solution that could serve as a basis for more general hpc models. 1.2

Content of the Work

Using a bridging model [20] is a first step to this solution because it simplifies the task of algorithm design, programming and simplifies the reasoning of cost and ensures a better portability from one system to another. A bridging model is an abstract model of a computer which provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. We conscientiously limit our work to the bulksynchronous parallel (bsp) bridging model [1,18] because it has the advantage of being endowed with a simple model of execution. We leave more complex models

74

Y. Marquer and F. Gava

to future work. Moreover, there are many different libraries and languages for programming bsp algorithms, for example, the bsplib for c [11] or java [17], bsml [?], pregel [12] for big-data, etc. Concurrent asms [3] try to capture the more general definition of asynchronous and distributed computations. We promote a rather different “bottomup” approach consisting of restricting the model under consideration, so as to better highlight the algorithm execution time (which is often too difficult to assess for general models) and more generally to formalize our algorithms of a bridging model at their natural level of abstraction, instead of using a more general model then restrict it with an arbitrary hypothesis. As a basis to this work, we first give an axiomatic definition of bsp algorithms (algoBSP ) with only 4 postulates. Then we extend the asm model [10] of computation (asmBSP ) for bsp. Our goal is to define a convincing set of parallel algorithms running in a predictable time and construct a model that computes these algorithms only. This can be summarized by algoBSP=asmBSP . An interesting and novel point of this work is that the bsp cost model is preserved. 1.3

Outline

Many definitions used here are well known to the asm community. Recalling all of them would be too long but they are available in the online technical report [22]. The remainder of this paper is structured as follows: In Sect. 2 we first recall the bsp model and define its postulates; Secondly, in Sect. 3, we give the operational semantics of asmBSP and finally, we give the main result. Section 4 concludes, gives some related work and a brief outlook on future work.

2 2.1

Characterizing BSP Algorithms The BSP Bridging Model of Computation

As the ram model provides a unifying approach that can bridge the worlds of sequential hardware and software, so valiant sought [20] for a unifying model that could provide an effective (and universal) bridge between parallel hardware and software. A bridging model [20] allows to reduce the gap between an abstract execution (programming an algorithm) and concrete parallel systems (using a compiler and designing/optimizing a physical architecture). The direct mode bsp model [1,18] is a bridging model that simplifies the programming of various parallel architectures using a certain level of abstraction. The assumptions of the bsp model are to provide portable and scalable performance predictions on hpc systems. Without dealing with low-level details of hpc architectures, the programmer can thus focus on algorithm design only. The bsp bridging model describes a parallel architecture, an execution model for the algorithms, and a cost model which allows to predict their performances on a given bsp architecture.

An Axiomatization for BSP Algorithms

75

A bsp computer can be specified by p uniform computing units (processors), each capable of performing one elementary operation or accessing a local memory in one time unit. Processors communicate by sending a data to every other processor in g time units (gap which reflects network bandwidth inefficiency), and a barrier mechanism is able to synchronise all the processors in L time units (“latency” and the ability of the network to deliver messages under a continuous load). Such values, along with the processor’s speed (e.g. Mflops) can be empirically determined by executing benchmarks. The time g is thus for collectively delivering a 1-relation which is a collective exchange where p 0 p1 p2 p3 every processor receives/sends at most one word. local computations The network can deliver an h-relation in time g×h. A bsp computation is organized as a sequence of communication barrier supersteps (see Fig. 1). During a superstep, the .. .. .. .. next super-step processors may perform computations on local data . . . . or send messages to other processors. Messages are Fig. 1. A bsp super-step. available for processing at their destinations by the next superstep, and each superstep is ended with the barrier synchronisation of the processors. The execution time (cost) of a super-step s is the sum of the maximal of the local processing, the data delivery and the global synchronisation times. It is expressed by the following formula: Cost(s) = ws + hs × g + L where ws = max0≤i


Axiomatic Characterization of BSP Algorithms

Postulate 1 (Sequential Time). A bsp algorithm A is given by: 1. A set of states S(A); 2. A set of initial states I(A) ⊆ S(A); 3. A transition function τA : S(A) → S(A). We follow [10] in which states, as first-order structures, are full instantaneous descriptions of an algorithm. Definition 1 (Structure). A (first-order) structure X is given by: 1. A (potentially infinite) set U(X) called the universe (or domain) of X 2. A finite set of function symbols L(X) called the signature (language) of X 3. For every symbol s ∈ L(X) an interpretation sX such that: (a) If c has arity 0 then cX is an element of U(X) X (b) If f has an arity α 0 then f is an application: U(X)α → U(X)

76

Y. Marquer and F. Gava

In order to have a uniform presentation [10], we considered constant symbols in L(X) as 0-ary function symbols, and relation symbols R as their indicator function χR . Therefore, every symbol in L(X) is a function. Moreover, partial functions can be implemented with a special symbol undef , and we assume in this paper that every L(X) contains the boolean type (¬, ∧) and the equality. We also distinguish dynamic symbols whose interpretation may change from one state to another, and static symbols which are the elementary operations. Definition 2 (Term). A term of L(X) is defined by induction: 1. If c has arity 0, then c is a term 2. If f has an arity α > 0 and θ1 , . . . , θα are terms, then f (θ1 , . . . , θα ) is a term The interpretation θ

X

of a term θ in a structure X is defined by induction on θ: X

def

= cX 1. If θ = c is a constant symbol, then θ 2. If θ = f (θ1 , . . . , θα ) where f is a symbol of the language L(X) with arity α > 0 and θ1 , . . . , θα are terms, then θ

X

def

X

X

X

= f (θ1 , . . . , θα )

A formula F is a term with the particular form true|false|R (θ1 , . . . , θα ) |¬F X |(F1 ∧ F2 ) where R is a relation symbol (ie a function with output true or X false ) and θ1 , . . . , θα are terms. We say that a formula is true (resp. false) in X X X X if F = true (resp. false ). A bsp algorithm works on independent and uniform  computing units. Therefore, a state St of the algorithm A must be a tuple Xt1 , . . . , Xtp . To simplify, we annotate tuples from 1 to p and not from 0 to p − 1. Notice that p is not fixed for the algorithm, so A can have states using different size of “p-tuples” (informally p, the number of processors). In this paper, we will simply consider that this number is preserved during a particular execution. In other words: the size of the p-tuples is fixed for an execution by the initial state of A for such an execution.   If X 1 , . . . , X p is a state of the algorithm A, then the structures X 1 , . . . , X p will be called processors or local memories. The set of the independent local memories of A will be denoted by M (A). We now define the bsp algorithms as the objects verifying the four presented postulates. The computation for every processor is done in parallel and step by step. An execution of A is a sequence of states S0 , S1 , S2 , . . . such that S0 is an initial state and for every t ∈ N, St+1 = τA (St ). Instead of defining a set of final states for the algorithms, we will say that a state St of an execution is final if τA (St ) = St , that is the execution is: S0 , S1 , . . . , St−1 , St , St , . . . We say that an execution is terminal if it contains a final state. We are interested in the algorithm and not a particular implementation (eg, the variables’ names), therefore in the postulate we will consider the states up to multi-isomorphism. → − Definition 3 (Multi-isomorphism). ζ is a multi-isomorphism between two     → − states X 1 , . . . , X p and Y 1 , . . . , Y q if p = q and ζ is a p-tuple of applications

An Axiomatization for BSP Algorithms

77

ζ1 , . . . , ζp such that for every 1 ≤ i ≤ p, ζi is an isomorphism between X i and Y i . Postulate 2 (Abstract States). For every bsp algorithm A: 1. The states of A are p-tuples of structures with the same finite signature L(A); 2. S(A) and I(A) are closed by multi-isomorphism; 3. The transition function τA preserves p, the universes and commutes with multi-isomorphisms. For a bsp algorithm A, let X be a local memory of A, f ∈ L(A) be a dynamic α-ary function symbol, and a1 , . . . , aα , b be elements of the universe U(X). We say that (f, a1 , . . . , aα ) is a location of X, and that (f, a1 , . . . , aα , b) is an update on X at the location (f, a1 , . . . , aα ). For example, if x is a variable then (x, 42) is an update at the location x. But symbols with arity α > 0 can be updated too. For example, if f is a one-dimensional array, then (f, 0, 42) is an update at the location (f, 0). If u is an update then X ⊕ u is a new structure of signature L(A) and universe U(X) such that the interpretation of a function symbol f ∈ L(A) is:  → b if u = (f, − a , b) X⊕u − def → f (a) = X − → f ( a ) otherwise → − where we noted a = a1 , . . . , aα . For example, in X ⊕ (f, 0, 42), every symbol has the same interpretation than in X, except maybe for f because f X⊕(f,0,42)

X⊕(f,0,42)

(0) =

X

42 and f (a) = f (a) otherwise. We precised “maybe” because it may X be possible that f (0) is already 42. X → → If f (− a ) = b then the update (f, − a , b) is said trivial in X, because nothing − → → has changed. Indeed, if (f, a , b) is trivial in X then X ⊕ (f, − a , b) = X. If Δ is a set of updates then Δ is consistent if it does not contain two distinct updates with the same location. Notice that if Δ is inconsistent, then → → there exists (f, − a , b), (f, − a , b ) ∈ Δ with b = b and, in that case, the entire set of updates clashes:  → b if (f, − a , b) ∈ Δ and Δ is consistent X⊕Δ − def → f (a) = X − → f ( a ) otherwise If X and Y are two local memories of the same algorithm A then there exists Y → X → → a unique consistent set Δ = {(f, − a , b) | f (− a ) = b and f (− a ) = b} of non trivial updates such that Y = X ⊕ Δ. This Δ is called the difference between the two local memories, and is denoted by Y X.  →  − Let X = X 1 , . . . , X p be a state of A. According to the transition function → − → − → − τA , the next state is τA ( X ), which will be denoted by (τA ( X )1 , . . . , τA ( X )p ). − → → − def We denote by Δi (A, X ) = τA ( X )i X i the set of updates done by the i-th → − → − − → def − → − → processor of A on the state X , and by Δ(A, X ) = (Δ1 (A, X ), . . . , Δp (A, X ))

78

Y. Marquer and F. Gava

→ − → − the “multiset” of updates done by A on the state X . In particular, if a state X → − → − → − → − − → is final, then τA ( X ) = X , so Δ(A, X ) = ∅ . LetA be a bsp algorithm  and T be  a set of terms of L(A). We say that two states X 1 , . . . , X p and Y 1 , . . . , Y q of A coincide over T if p = q and for every 1 ≤ i ≤ p and for every t ∈ T we have t

Xi

=t

Yi

.

Postulate 3 (Bounded Exploration for Processors). For every bsp algo− → → − rithm A there exists a finite set T (A) of terms such that for every state X and Y , → − − → → − − → if they coincide over T (A) then Δ(A, X ) = Δ(A, Y ), i.e. for every 1 ≤ i ≤ p, − → − → we have Δi (A, X ) = Δi (A, Y ). T (A) is called the exploration witness [10] of A. If a set of terms T is finite then its closure by subterms is finite too. We assume that T (A) is closed by subterms and the symbol “true” should always be in the exploration witness [10]. The interpretations of the terms in T (A) are called the critical elements and we prove in [22] that every value in an update is a critical element:   Lemma 1 (Critical Elements). For every state X 1 , . . . , X p of A, ∀i 1 ≤ − → → → i ≤ p, if (f, − a , b) ∈ Δi (A, X ) then − a , b are interpretations in X i of terms in T (A).

That implies that for every step of the computation, for a given processor, only a bounded number of terms are read or written (amount of work).   Lemma 2 (Bounded Set of Updates). For every state X 1 , . . . , X p of the − → algorithm A, for every 1 ≤ i ≤ p, |Δi (A, X )| is bounded. Notice that for the moment we make no assumption on the communication between processors. Moreover, these three postulates are a “natural” extension of the ones of [10]. And by “natural”, we mean that if we assume that p = 1 then our postulates are exactly the same: Lemma 3 (A Single Processor is Sequential). A bsp algorithm with a unique processor (p = 1) is a sequential algorithm. Therefore algoseq ⊆ algoBSP . We now organize the sequence of states into supersteps. The communication between local memories occurs only during a communication phase. In order to do so, a bsp algorithm A will use two functions compA and commA indicating if A runs computations or if it runs communications. Postulate 4 (Supersteps phases). For every bsp algorithm A there exists two applications compA : M (A) → M (A) commuting with isomorphisms, and   commA : S(A) → S(A), such that for every state X 1 , . . . , X p : ⎧  ⎨ compA (X 1 ), . . . , compA (X p ) if ∃1 ≤ i ≤ p  1  such that compA (X i ) = X i τA X , . . . , X p =   ⎩ otherwise commA X 1 , . . . , X p

An Axiomatization for BSP Algorithms

79

A BSP algorithm is an object verifying these four  and we denote  postulates, by algoBSP the set of the bsp algorithms. A state X 1 , . . . , X p will be said in a computation phase if there exists 1 ≤ i ≤ p such that compA (X i ) = X i . Otherwise, the state will be said in a communication phase. This requires some remarks. First, at every computation step, every processor which has not terminated performs its local computations. Second, we do not specified the function commA in order to be generic about which bsp library is used. We discuss in Sect. 3.3 the difference between commA and the usual communication routines in the bsp community. → − → − → − Remember that a state X is said to be final if τA ( X ) = X . Therefore, → − according to the fourth postulate, X must be in a communication phase which is like a final phase that would terminate the whole execution as found in mpi. We prove that the bsp algorithms satisfy, during a computation phase, that every processor computes independently of the state of the other processors: Lemma 4 (No Communication during Computation Phases). For    every states X 1 , . . . , X p and Y 1 , . . . , Y q in a computing phase, if X i and − → − → Y j have the same critical elements then Δi (A, X ) = Δj (A, Y ). 2.3

Questions and Answers

Why not using a bsp-Turing machine to define an algorithm? It is known that standard Turing machines could simulate every algorithm. But we are here interested in the step-by-step behavior of the algorithms, and not the input-output relation of the functions. In this way, there is not a literal identity between the axiomatic point of view (postulates) of algorithms and the operational point of view of Turing machines. Moreover, simulating algorithms by using a Turing-machine is a low-level approach which does not describe the algorithm at its natural level of abstraction. Every algorithm assumes elementary operations which are not refined down to the assembly language by the algorithm itself. These operations are seen as oracular, which means that they produce the desired output in one step of computation. But I think there is too much abstractions: When using bsplib, messages received at the past superstep are dropped. Your function commA does not show this fact. We want to be as general as possible. Perhaps a future library would allow reading data received n supersteps ago as the BSP+ model of [19]. Moreover, the communication function may realize some computations and is thus not a pure transmission of data. But the exploration witness forbids doing whatever: only a finite set of symbols can be updated. And we provide a realistic example of such a function which mainly correspond to the bsplib’s primitives [22]. And why is it not just a permutation of values to be exchanged? The communications can be used to model synchronous interactions with the environment (input/output or error messages, etc.) and therefore make appear or disappear values.

80

Y. Marquer and F. Gava

And when using bsplib and other bsp libraries, I can switch between sequential computations and bsp ones. Why not model this kind of feature? The sequential parts can be modeled as purely asynchronous computations replicated and performed by all the processors. Or, one processor (typically the first one) is performing these computations while other processors are “waiting” with an empty computation phase. In [2,3,15,16], the authors give more general postulates about concurrent and/or distributed algorithms? Why not using their works by adding some restrictions to take into account the bsp model of execution? It is another solution. But we think that the restrictions on “more complex” postulates is not a natural characterization of the bsp algorithms. It is better for a model to be expressed at its natural level of abstraction in order to highlight its own properties. For example, there is the problematic of the cost model which is inherent to a bridging model like bsp: It is not clear how such restrictions could highlight the cost model. Fine. But are you sure about your postulates? I mean, are they completely (and not more) defined bsp algorithms? It is impossible to be sure because we are formalizing a concept that is currently only intuitive. But as they are general and simple, we believe that they correctly capture this intuitive idea. We prove in the next section that a natural operational model for bsp characterizes exactly those postulates. Would not that be too abstract? The bsp model is supposed to be a bridging model. We treat algorithms at their natural level of abstraction, and not as something to refine to machines: We explicitly assume that our primitives may not be elementary for a typical modern architecture (but could be so in the future) and that they can achieve a potentially complex operation in one step. This makes it possible to get away from a considered hardware model and makes it possible to calculate the costs in time (and in space) in a given framework which can be variable according to what is considered elementary. For example, in an Euclidean algorithm, it is either the Euclidean division that is elementary or the subtraction. If your bsp algorithm uses elementary operations which can not be realized on the bsp machine considered, then you are just not at the right level abstraction. Our work is still valid for any level of abstraction.

3

BSP-ASM Captures the BSP Algorithms

The four previous postulates define the bsp algorithms from an axiomatic viewpoint but that does not mean that they have a model, or in, other words, that they are defined from an operational point of view. In the same way that the model of computation asm captures the set of the sequential algorithms [10], we prove in this section that the asmBSP model captures the bsp algorithms.

An Axiomatization for BSP Algorithms

3.1

81

Definition and Operational Semantics of ASM-BSP

Definition 4 (ASM Program [10]) def

Π = f (t1 , . . . , tα ) := t0 | if F then Π1 else Π2 endif | par Π1 . . . Πn endpar where f has arity α; F is a formula; θ1 , . . . , θα , θ0 are terms of L(X). Notice that if n = 0 then par Π1 . . . Πn endpar is the empty program. If in if F then Π1 else Π2 endif the program Π2 is empty we will write simply if F then Π1 endif. An asm machine [10] is thus a kind of Turing machine using not a tape but an abstract structure X. Definition 5 (ASM Operational Semantics)  X X X def Δ(f (θ1 , . . . , θα ) := θ0 , X) = (f, θ1 , . . . , θα , θ0 ) def

Δ(if F then Π1 else Π2 endif, X) = Δ(Π

i , X) i = 1 if F is true on X where i = 2 otherwise def

Δ(par Π1 . . . Πn endpar, X) = Δ(Π1 , X) ∪ · · · ∪ Δ(Πn , X) Notice that the semantics of the par is a set of updates done simultaneously, which differs from an usual imperative framework. A state of a asmBSP machine is a p-tuple of memories (X 1 , . . . , X p ). We assume that the asmBSP programs are spmd (single program multiple data) which means that at each step of computation, the asmBSP program Π is executed individually on each processor. → − Therefore Π induces a multiset of updates Δ and a transition function τΠ :  def    − → Δ(Π, X 1 , . . . , X p ) = Δ(Π, X 1 ), . . . , Δ(Π, X p )   def   τΠ X 1 , . . . , X p = X 1 ⊕ Δ(Π, X 1 ), . . . , X p ⊕ Δ(Π, X p ) → − → − If τΠ ( X ) = X , then every processor has finished its computation steps. In that case we assume that there exists a communication function to ensure the communication between processors. Definition 6. An asmBSP machine M is a triplet (S(M ), I(M ), τM ) such that: 1. S(M ) is a set of tuples of structures with the same finite signature L(M ); S(M ) and I(M ) ⊆ S(M ) are closed by multi-isomorphism; 2. τM : S(M ) → S(M ) verifies that there exists a program Π and an application commM : S(M ) → S(M ) such that:  → − → − → − → − τΠ ( X ) if τΠ ( X ) = X τM ( X ) = → − commM ( X ) otherwise

82

Y. Marquer and F. Gava

3. commM verifies that: → − → − → − (1) For every state X such that τΠ ( X ) = X , commM preserves the universes and the number of processors, and commutes with multi-isomorphisms (2) There exists a finite set of terms T (commM ) such that for every state → − → − → − → − → − → − X and Y with τΠ ( X ) = X and τΠ ( Y ) = Y , if they coincide over → − − → → − − → T (commM ) then Δ(M, X ) = Δ(M, Y ). → − We denote by asmBSP the set of such machines. As before, a state X is said → − → − → − → − → − → − → − final if τM ( X ) = X . So if X is final then τΠ ( X ) = X and commM ( X ) = X . The last conditions about the communication function may seem arbitrary, but they are required to ensure that the communication function is not a kind of magic device. For example, without these conditions, we could imagine that commM may compute the output of the algorithm in one step, or solve the halting problem. Moreover, we construct an example of commM in [22] (Section D). 3.2

The BSP-ASM Thesis

We prove that asmBSP captures the computation phases of the bsp algorithms in three steps. First, we prove that during an execution, each set of updates is the interpretation of an asm program (Lemma 8 p.16 [22]). Then, we prove an equivalence between these potentially infinite number of programs (Lemma 9 p.17). Finally, by using the third postulate, we prove in Lemma 10 p.18 that there is only a bounded number of relevant programs, which can be merged into a single one. Proposition 1 (BSP-ASMs capture Computations of BSP Algorithms). For every bsp algorithm A, there exists an asm program ΠA such → − → − − → → − − → that for every state X in a computation phase: Δ(ΠA , X ) = Δ(A, X ). Theorem 1. algoBSP=asmBSP (The proof is available in [22], Section C p.20). 3.3

Cost Model Property and the Function of Communication

There is two more steps in order to claim that asmBSP objects are the bsp bridging model algorithms: (1) To ensure that the duration corresponds to the standard cost model and; (2) To solve issues about the communication function. Cost Model. If the execution begins with a communication, we assume that − → no computation is done for the first superstep. We remind that a state Xt is in a computation phase if there exists 1 ≤ i ≤ p such that compA (Xti ) = Xti . The computation for every processor is done in parallel, step by step. So, the cost in def time of the computation phase is w = max1≤i≤p (wi ), where wi is the number of steps done by the processor i (on processor X i ) during the superstep. Then the state is in a communication phase, when the messages between the processors are sent and received. Notice that commA may require several

An Axiomatization for BSP Algorithms

83

steps in order to communicate the messages, which contrasts with the usual approach in bsp where the communication actions of a superstep are considered as one unit. But this approach would violate the third postulate, so we had to consider a step-by-step communication approach, then consider these actions as one communication phase. asmBSP exchanges terms and we show in [22] how formally define the size of terms. But we can imagine a machine that must further decompose the terms in order to transmit them (in bits for example). We just assume that the data are communicable in time g for a 1-relation. So, during the superstep, the communication phase requires h × g steps. It remains to add the cost of the synchronization of the processors, which is assumed in the usual bsp model to be a parameter L. Therefore, we obtained a cost property which is sound with the standard bsp cost model. A Realization of the Communication. An example of a communication function for the standard bsplib’s primitives bsp_get, bsp_put, bsp_send bsp_move is presented in [22] (Section D). Proposition 2 (Communication). A function of communication, with routines for distant readings/writings and point-to-point sendings, performing an h-relation and requiring at most h exchanges can be designed using asm. One may argue that the last postulate allows the communication function to do computations. To avoid it, we assume that the terms in the exploration witness T (M ) can be separated between T (Π) and T (commM ) such that T (Π) → is for the states in a computation phase, and that for every update (f, − a , b) i of a processor X in a communication phase, either there exists a term t ∈ Xi T (commM ) such that b = t , or there exists a variable v ∈ T (Π) and a processor Xi

X j such that b = tvX j (representation presented in Section D p.24). To do a computation, a term like x+1 is required, so the restriction to a variable prevents the computations of the terms in T (Π). Or course, the last communication step should be able to write in T (Π), and the final result should be read in T (Π).

4 4.1

Conclusion and Future Work Summary of the Contribution

A bridging model provides a common level of understanding between hardware and software engineers. It provides software developers with an attractive escape route from the world of architecture-dependent parallel software [20]. The bsp bridging model allows the design of “immortal” (efficient and portable) parallel algorithms using a realistic cost model (and without any overspecification requiring the use of a large number of parameters) that can fit most distributed architectures. It has been used with success in many domains [1]. We have given an axiomatic definition of bsp algorithms by adding only one postulate to the sequential ones for sequential algorithms [10] which has been

84

Y. Marquer and F. Gava

widely accepted by the scientific community. Mainly this postulate is the call of a function of communication. We abstract how communication is performed, not be restricting to a specific bsp library. We finally answer previous criticisms by defining a convincing set of parallel algorithms running in a predictable time. Our work is relevant because it allows universality (immortal stands for bsp computing): all future bsp algorithms, whatever their specificities, will be captured by our definitions. So, our asmBSP is not just another model, it is a class model, which contains all bsp algorithms. This small addition allows a greater confidence in this formal definition compared to previous work: Postulates of concurrent asms do not provide the same level of intuitive clarity as the postulates for sequential algorithms. But our work is limited to bsp algorithms even if it is still sufficient for many hpc and big-data applications. We have thus revisited the problem of the “parallel ASM thesis” i.e., to provide a machine-independent definition of bsp algorithms and a proof that these algorithms are faithfully captured by asmBSP . We also prove that the cost model is preserved which is the main novelty and specificity of this work compared to the traditional work about distributed or concurrent asms. 4.2

Questions and Answers About this Work

Why do you use a new model of computation asmBSP instead of asmsonly? Indeed, each processor can be seen as a sequential asm. So, in order to simulate one step of a bspalgorithm using several processors, we could use pids to compute sequentially the next step for each processor by using an asm. Even if such a simulation exists between these two models, what you mean, a “sequentialization” (each processor, one after the other) of the bsp model of execution, cannot be exactly the function of transition of the postulates. Moreover, in order to stay bounded, having p exploration witness (one for each sequential asm) induces p to be a constant for the algorithm. In our work, p is only fixed of each execution, making the approach more general when modeling algorithms. Is another model possible to characterize the bsp algorithms? Sure. This can be more useful for proving some properties. But that would be the same set, just another way to describe it. So, reading the work of [3], a distributed machine is defined as a set of pairs (a, Πa ) where a is the name of the machine and Πa a sequential asm. Reading your definition, I see only one Π and not “p” processors as in the bsp model. I thus not imagine a bsp computer as it is. You are absolutely right but we do not model a bsp computer, our work is about bsp algorithms. The asmBSP program contains the algorithm which is used on each “processor” (a first-order structure as explain before). These are the postulates (axiomatic point of view) that characterize the class of bsp algorithms rather than a set of abstract machines (operational point of view). That is closer to the original approach [10]. We also want to point out that, unlike [3], we are not

An Axiomatization for BSP Algorithms

85

limited to a finite (fixed) set of machines: In our model, an algorithm is defined for p = 1, 2, 1000, etc. And we are not limited to point-to-point communications. Ok, but with only a single code, you cannot have all the parallel algorithms... We follow [4] about the difference between a PARallel composition of SEQuential actions (PAR of SEQ) and a SEQuential composition of PARallel actions (SEQ of PAR). Our asmBSP is SEQ(PAR). This leads to a macroscopic point of view1 which is close to a specification. Being a SEQ(PAR) model allows a high level description of the bsp algorithms. So, why are you limited to spmd computations? Different codes can be run by the processors using conditionals on the “id” of the processors. For example “if pid=0 then code1 else code2” for running “code1” (e.g. master part) only on processor 0. Again, we are not limited to spmd computations. The asm program Π fully contains the bsp algorithm, that is all the“actions” that can be performed by any processors, not necessarily the same instructions: Each processor picks the needed instruction to execute but there could be completely different. Only the size of Π is finite due to the exploration witness. For example, it is impossible to have a number of conditionals in Π that depends of p. Indeed, according to Lemma 4, during a computation phase, if two processors coincide over the exploration witness, then they will use the same code. And according to Postulate 3, the exploration witness is bounded. So, there exists only a bounded number c of possible subroutines during the computation phase, even if pc. Notice that processors may not know their own ids and there is no order in p-tuples; We never use such a property: Processors are organized like a set and we use tuples only for convenience of notation. We are using p-tuples just to add the bsp execution model in the original postulates of [10]. Ok, but I cannot get the interleavings of the computations as in [3]? Your model seems very synchronous! The bsp model makes the hypothesis that the processors are uniform. So if one processor can perform one step of the algorithm, there is no reason to lock it just to highlight an interleaving. And if there is nothing to do, it does nothing until the phase of communication. Our execution model is thus largely “asynchronous” during the computation phases. Speaking about communication, why apply several times the function of communication? When designing a bsp algorithm, I use once a collective operation! An asm is like a Turing machine. It is not possible to perform all the communications in a single step: The exploration witness forbids doing this. Our function of communication performs some exchanges until there are no more.

1

Take for example a bsp sorting algorithm: First all the processors locally sort there own data, and then, they perform some exchanges in order to have the elements sorted between them. One defines it as a sequence of parallel actions and being also independent to the number of processors.

86

Y. Marquer and F. Gava

What happens in case of runtime errors during communications? Typically, when one processor has a bigger number of super-steps than other processors, or when there is an out-of-bound sending or reading, it leads to a runtime error. The bsp function of communication can return a ⊥ value. That causes a stop of the operational semantics of the asmBSP . 4.3

Related Work

As far as we know, some work exists to model distributed programs using asms [15] but none to convincingly characterize bsp algorithms. In [6], authors model the p3l set of skeletons. That allows the analyze of p3l programs using standard asm tools but not a formal characterization of what p3l is and is not. The first work to extend asms for concurrent, distributed, agent-mobile algorithms is [2]. Too many postulates are used making the comprehension hard to follow or worse (loss of confidence). A first attempt to simplify this work has been done in [16] and again simplified in [7] by the use of multiset comprehension terms to maintain a kind of bounded exploration. Then, the authors prove that asms captures these postulates. Moreover, we are interested in distributed (hpc) computations more than parallel (threading) asms. We want to clarify one thing. The asm thesis comes from the fact that sequential algorithms work in small steps, that is steps of bounded complexity. But the number of processors (or computing units) is unbounded for parallel algorithms, which motivated the work of [2] to define parallel algorithms with wide steps, that is steps of unbounded complexity. Hence the technicality of the presentation, and the unconvincing attempts to capture parallel algorithms [3]. Extending the asms for distributed computing is not new [3]. We believe that these postulates are more general than ours but we think that our extension still remains simple and natural for bsp algorithms. The authors are also not concerned about the problem of axiomatizing classes of algorithms using a cost model which is the heart of our work and the main advantage of the bsp model. 4.4

Future Work

This work leads to many possible work. First, how to adapt our work to a hierarchical extension of bsp [21] which is closer to modern hpc architectures? Second, bsp is a bridging model between hardwares and softwares. It could be interesting to study such a link more formally. For example, can we prove that the primitives of a bsp language can truly “be bsp” on a typical cluster architecture? Thirdly, we are currently working on extending the work of [13] in order to give the bsp algorithmic completeness of a bsp imperative programming language. There are some concrete applications: There are many languages having a bsp-like model of execution, for example pregel [12] for writing large-graph algorithms. An interesting application is proving which are bsp algorithmically complete and are not. bsplib programs are intuitively bsp. mapreduce is a

An Axiomatization for BSP Algorithms

87

good candidate to be not [14]. Similarly, one can imagine proving which languages are too expressive for bsp. mpi is intuitively one of them. Last, the first author is working on postulates for more general distributed algorithm ` a la mpi. In any case, studying the bsp-ram (such as the communication-oblivious of [19]) or mapreduce, would led to define subclasses of bsp algorithms.

References 1. Bisseling, R.H.: Parallel Scientific Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004) 2. Blass, A., Gurevich, Y.: Abstract state machines capture parallel algorithms. ACM Trans. Comput. Log. 4(4), 578–651 (2003) 3. B¨ orger, E., Schewe, K.-D.: Concurrent abstract state machines. Acta Inf. 53(5), 469–492 (2016) 4. Boug´e, L.: The data parallel programming model: a semantic perspective. In: Perrin, G.-R., Darte, A. (eds.) The Data Parallel Programming Model. LNCS, vol. 1132, pp. 4–26. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-617361 40 5. Cappello, F., Snir, M.: On communication determinism in HPC applications. In: Computer Communications and Networks (ICCCN), pp. 1–8. IEEE (2010) 6. Cavarra, A., Zavanella, A.: A formal model for the parallel semantics of p3l. In: ACM Symposium on Applied Computing (SAC), pp. 804–812 (2000) 7. Ferrarotti, F., Schewe, K.-D., Tec, L., Wang, Q.: A new thesis concerning synchronised parallel computing –simplified parallel ASM thesis. Theor. Comput. Sci. 649, 25–53 (2016) 8. Gonz´ alez-V´elez, H., Leyton, M.: A survey of algorithmic skeleton frameworks. Softw. Pract. Exp. 40(12), 1135–1160 (2010) 9. Gorlatch, S.: Send-receive considered harmful: myths and realities of message passing. ACM TOPLAS 26(1), 47–56 (2004) 10. Gurevich, Y.: Sequential abstract-state machines capture sequential algorithms. ACM Trans. Comput. Log. 1(1), 77–111 (2000) 11. Hill, J.M.D., McColl, B., et al.: BSPLIB: the BSP programming library. Parallel Comput. 24, 1947–1980 (1998) 12. Malewicz, G., et al.: pregel: a system for large-scale graph processing. In: Management of data, pp. 135–146. ACM (2010) 13. Marquer, Y.: Algorithmic completeness of imperative programming languages. Fundamenta Informaticae, pp. 1–27 (2017, accepted) 14. Pace, M.F.: BSP vs MAPREDUCE. Procedia Comput. Sci. 9, 246–255 (2012) 15. Prinz, A., Sherratt, E.: Distributed ASM- pitfalls and solutions. In: Ait Ameur, Y., Schewe, K.D. (eds.) ABZ 2014. Lecture Notes in Computer Science, vol. 8477, pp. 210–215. Springer, Heidelberg (2014) 16. Schewe, K.-D., Wang, Q.: A simplified parallel ASM thesis. In: Derrick, J., et al. (eds.) ABZ 2012. LNCS, vol. 7316, pp. 341–344. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30885-7 27 17. Seo, S., et al.: HAMA: an efficient matrix computation with the MAPREDUCE framework. In: Cloud Computing (CloudCom), pp. 721–726. IEEE (2010) 18. Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and answers about BSP. Sci. Program. 6(3), 249–274 (1997)

88

Y. Marquer and F. Gava

19. Tiskin, A.: The design and analysis of bulk-synchronous parallel algorithms. PhD thesis. Oxford University Computing Laboratory (1998) 20. Valiant, L.G.: A bridging model for parallel computation. Comm. ACM 33(8), 103–111 (1990) 21. Valiant, L.G.: A bridging model for multi-core computing. J. Comput. Syst. Sci. 77(1), 154–166 (2011) 22. Marquer, Y., Gava, F.: An ASM thesis for BSP. Technical report (2018). https:// hal.archives-ouvertes.fr/hal-01717647

Efficient and Secure Outsourced Linear Regression Haomiao Yang(B) , Weichao He(B) , Qixian Zhou(B) , and Hongwei Li School of Computer Science and Engineering and Center for Cyber Security, University of Electronic Science and Technology of China, Chengdu, China {haomyang,hongweili}@uestc.edu.cn, [email protected], [email protected]

Abstract. The linear regression, as a classical machine learning algorithm, is often used to be a predictor. In the era of big data, the data owner can outsource their linear regression task and data to the cloud server, which has powerful calculation and storage resources. However, outsourcing data may break the privacy of the data. It is a well-known method to encrypt them prior to uploading to the cloud by using the homomorphic encryption (HE). Nevertheless, it is a difficult problem to apply the linear regression protocol in the encrypted domain. With this observation, we propose an efficient and secure linear regression protocol over outsourced encrypted data by using the vector HE, named ESLR, and in our protocol, we further present a privacy-preserving gradient descent method. Security analysis shows that our protocol can guarantee the confidentiality of data. And compared to the linear regression over plaintexts, our proposal can achieve almost the same accuracy and efficiency over ciphertexts. Keywords: Machine learning · Homomorphic encryption Linear regression · Gradient descent

1

Introduction

Predictive modeling is an essential tool in decision making processes in domains such as policy making, medicine, law enforcement, and finance. Considering a hospital would like to use a cloud service which provide predictive service to analyze the patient’s condition so as to improve the quality of care and reduce costs. Due to ethical and legal requirements, the hospital might be restricted to use such service [3,4,12]. Like the hospital, many organizations are collecting ever-increasing data for mining to improve decision-making and productivity. However, they may have no powerful resources to deal with such large-scale data. To solve this problem, an attractive business model is that a service provider, which has powerful platforms and advanced analytic skills, provides such services. Organizations who need the calculation resource can outsource their computational tasks to such powerful service providers. However, because the data c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 89–102, 2018. https://doi.org/10.1007/978-3-030-05057-3_7

90

H. Yang et al.

may contain sensitive information, outsourcing data to public clouds directly raises privacy concerns. In current implementations, the learning algorithm must see all user data in the clear in order to build the predictive model. In this paper, we consider whether the learning algorithm can operate in encrypted domains, thereby allowing users to retain control of their data. For medical data, this allows for a model to be built without affecting user privacy. For the book and movie preferences, letting users keep control of their data, can reduce the risk of future unexpected embarrassment in case of a data breach at the service provider. Roughly speaking, there are three existing approaches to essure the privacy when the server mines the user data. The first lets users split their data among multiple servers by using secure multi-party computation [2,5,9]. These servers, then, run the learning algorithm using a distributed protocol. Privacy is assured as long as a majority of servers do not collude. The second is based on differential privacy protection, where the learning algorithm is executed over data containing noise [6,7,13]. And the third is based on homomorphic encryption, where the learning algorithm is executed over encrypted data [15]. Distributed linear regression is not suitable for outsourced model. In distributed linear regression, every party must take part in computation. Consequently, the secure multi-party computation may be inefficient. In addition, there may be a great loss of accuracy and can not fully guarantee the security of data by using the differential privacy protection. In this work, we choose homomorphic encryption for our privacy-preserving machine learning algorithm. As we know, homomorphic encryption (HE) allows operations on encrypted data, which provides a possible solution for linear regression over ciphertexts. In our work, we propose an efficient and secure linear regression protocol over encrypted data for outsourced environments, namely ESLR, where the cloud performs linear regression processing over encrypted data. The challenge is how to apply the linear regression algorithm over ciphertexts, while maintaining high accuracy and performance. To address these challenges, we exploit the vector HE (VHE) recently presented by Zhou and Wornell [17]. Unlike the fully HE (FHE), VHE only needs to support somewhat homomorphic encryption. As a result, it is much more efficient than many existing FHE schemes. For example, it is orders of magnitude faster than HELib [10], which is a very famous FHE implementation. Especially by designing ingeniously, VHE can be used in the privacy-preserving gradient descent. Different from existed works, our contributions are twofold as follows. (1) Firstly, ESLR reconstructs linear regression clustering process in the domain of ciphertext by taking advantage of the vector encryption, which allows low computation and communication cost. What’s more, we proposed a scheme that can apply privacy-preserving gradient descent method over ciphertext domain efficiently. To our best knowledges, it’s very efficient for the optimization algorithm over encrypted data. Experiments shows that ESLR achieves almost the same accuracy compared to the plaintext algorithm.

Efficient and Secure Outsourced Linear Regression

91

(2) Secondly, security analysis demonstrates that ESLR achieves the confidentiality of data, ensuring the privacy of the data owner. In addition, we give the definition of loss function, which is needed for optimization over ciphertext domain. This paper is organized as follows: The problem formulation is described in Sect. 2. The constructions of linear regression protocol are proposed in Sect. 3, followed by further discusses in Sect. 4. Then we give the security analysis and performance evaluation in Sects. 5 and 6, respectively. Finally, the conclusion is presented in Sect. 7.

2

Problem Statement

In this section, we give the problem statement, including system model and threat model, design goals, notations and preliminaries. 2.1

System Model and Threat Model

We give our system model concentrating on how to achieve secure liner regression over encrypted data in outsourced environments. As shown in Fig. 1, we proposed a classical outsourced system model, mainly consisting of two parties. The one is the data owner, and the other is the service provider. We primarily consider the service provider as an “honest-but-curious” server in our model. We   assume the public matrix H and encrypted data D (D is the encryption of the data D) have been outsourced to the cloud, and the confidentiality of the data

Cloud Server

Public matrix H

Data D

Data owner

Authorization of S

Fig. 1. System model

Data user

92

H. Yang et al.

will be protected by the underlying encryption primitive. After that, the server  will implement the regression algorithm based on D . That is, the data owner  outsources his encrypted data D , and the service provider runs the proposed  protocol over D . Finally the service provider returns the predicted results to the data owner. 2.2

Design Goals

The overarching goal is to enable liner regression algorithm to be performed over encrypted data. What’s more, for an efficient and secure liner regression protocol, we consider the following requirements to be necessary. – Accuracy: Enable secure linear regression over encrypted data in outsourced environments and achieve high accuracy. – Security: Protect privacy of linear regression process. – Efficiency: Process large amount of data with practical performance. 2.3

Overview of Standard Linear Regression and Gradient Descent

In this section, we give a brief introduction about standard linear regression algorithm [16]. In statistics, linear regression equation is a regression analysis using least square function to find the relationship between one or more independent variables and dependent variables. This function is a linear combination of one or more model parameters called regression coefficients. Linear regression with only one independent variable is called simple regression, and it is called multiple regression with greater than one independent variable. Like all forms of regression analysis, linear regression also focuses on the probability distribution of x and y. Given a random sample (xi1 , xi2 , ..., xip , yi ), we have one hypothetical regression output yi , and hypothetical regression inputs xi1 , xi2 , ..., xip . So a multivariate linear regression model is expressed as yi = w1 x1 + w2 x2 + · · · + wd xd + b. For a data set D = [(x 1 , y1 ), (x 2 , y2 ), · · · , (x n , yn )], the goal of linear regression is to get the regression coefficients θ = [w1 , w2 , · · · , wd , b] such that the loss function get the minimum value. We define the loss function as J(θ) = (

n 1  T  ) (θ x i − yi )2 . 2n i=1

Further, we formulate the problem as Algorithm 1. The gradient descent method [8] is one of the iterative methods, which can be used to solve the least squares problem. Gradient descent is one of the most commonly used methods in solving the model parameters of machine learning algorithm (unconstrained optimization problem). The other method commonly used is the least square method. When solving the minimum value of the loss function, we can get the minimum value of loss function and the model parameters through the gradient descent method.

Efficient and Secure Outsourced Linear Regression

93

Algorithm 1. Standard linear regression Input: data set D = {(x 1 , y1 ), (x 2 , y2 ),· · · , (x n , yn )} and threshold t Output: θ = [w1 , w2 , · · · , wd , b]  T 2 1 1: Define the loss function J(θ) = ( 2n ) n i=1 (θ x i − yi ) 0 2: Generating the θ randomly 3: repeat ) , where θ k is the value of kth iteration and α is the iteration 4: θ k = θ k−1 − α ∂J(θ ∂θ step. 5: until J(θ k+1 ) − J(θ k ) < t 6: return θ

2.4

Notations and Preliminaries

In this section, we review the preliminaries that are necessary for our work. First, we give notations used throughout the paper as illustrated in Table 1. Table 1. Notations Notation Meaning a

To round a to the nearest integer, for a ∈ R

a

To round each entry ai to the nearest integer, for a vector a ∈ Rn

a



To be a binary representation for a vector a ∈ Zn

We outline the VHE scheme as suggested by Zhou and Wornell [17] that encrypts integer vectors to allow computation of arbitrary polynomials in the encrypted domain. For our purpose of ESLR, we only consider the fundamental operations below and more details are referred to [17]. – VHE.KG(λ): Input a security parameter λ, choose l, m, n, p, q, w ∈ Z, and the distribution χ where l = log2 (q − 1), w(p − 1) < q, q  p, and m < n, construct S = [I , T ] ∈ Zm×n with I ∈ Zm×m as the identity matrix, and output the secret key S and the public parameters Param = (l, m, n, p, q, w, χ). – VHE.E (x,S ): Input a secret key S ∈ Zm×n and a plaintext vector x ∈ Zm , output a ciphertext c ∈ Zn that satisfies Sc = wx + e where w is a large integer, | S | w, and e is an error term with |e| < w/2. – VHE.D(c,S ): Input a ciphertext vector c ∈ Zn and a secret key S ∈ Zm×n , output a plaintext x ∈ Zm that satisfies x = Sc/w. For the VHE scheme, the key switching is an important operation in the   encrypted domain. Given two secret keys S ∈ Zm×n and S ∈ Zm×n , and

94

H. Yang et al.

the ciphertext c ∈ Zn which decrypts to the plaintext x ∈ Zm with S , we calcu   late a matrix M ∈ Zn ×nl producing a new ciphertext c ∈ Zn so as to decrypt   c to the same x with S . In specific, this key switching task can be divided two   steps: M ← VHE.KSM (S , S ) and c ← VHE.KS (M,c). Furthermore, as inferred by [17], for the plaintext x , the ciphertext c, and the key-switching matrix M , the following equation holds. c = M (wx )∗ In addition, it is obvious that VHE supports the operation of the addition in ciphertexts domain as S (c 1 + c 2 + · · · + c n ) = w(x 1 + x 2 + · · · + x n ) + e. 2.5

Privacy-Preserving Inner Product

In this section, we present a new technique of computing the inner product of two vectors. For simplication, we can assume that there are two vectors x 1 and x 2 which are encrypted to c 1 and c 2 using the vector homomorphic encryption of VHE. The challenge is how to calculate the inner product on ciphertext domain. To tackle the problem, a matrix H is essential to be calculated. By solving equation AM = I ∗ , we have a matrix A. Then we can get the matrix H from H = AT A. We can prove that c T H c = w2 x T x . Hence, we can calculate the inner product in ciphertex domain, and will later discuss the security of this method.

3

Proposed Protocol

In this section, we will propose the protocol for linear regression over encrypted items in outsourced environments using VHE. 3.1

Reformulating the Problem

In this section, we give a brief introduction about our problem again. We supposed that the data owner owns a database D that can be thought to be a big table of n records x 1 , x 2 , · · · , x n . The record x i = [xi1 · · · xim ] includes m attributes. Because the resources of the data owner is limited, so the data owner encrypts his database D record-wise, and then outsources the encrypted  database D to the cloud. After that, the service provider will apply the linear regression over encrypted data sets, and return back the results to the data owner. In this protocol, the service provider know nothing to the plaintext.

Efficient and Secure Outsourced Linear Regression

3.2

95

Linear Regression Over VHE

With the preparatory work ahead, we discuss the problem of regression over encrypted data firstly. In order to make our protocol faster and easier, We only consider the security of data properties. Supposed dataset D = {(x 1 , y1 ), (x 2 , y2 ), · · · , (x n , yn )} which is only known by data owner are encrypted to be D  = {(c 1 , y1 ), (c 2 , y2 ), · · · , (c n , yn )}. The relation between plaintext and ciphertext satisfies S c i = wx i + e i where i = 1, 2, · · · , n. When the service provider get the encrypted data sets D  from the data owner, he will apply linear regression protocol over D  . The whole process is divided three phases: Preparation, Regression, and BackResults. – Preparation(D, λ). The security parameter λ and the data sets D is taken as the input, and the data owner generates a secret key S and a key-switch Matrix M for every record which satisfies the following equation. c = M (wx )∗ , where c is the ciphertext of x . The data owner need to calculate the key-switch matrix M only once and the data owner can use the key-switch M to encrypted data sets x . As we know, the scheme of VHE cost most is key-switch. If we use the same key-switch M to encrypt data, We can save a lot of overhead on encryption. Then, the data owner need to calculate the matrix H, which is used to define the loss function over encrypted data. As we know, the following equation holds. wx = I ∗ (wx )∗ . The data owner solve a matrix equation which satisfies: AM = I ∗ . Then, the data owner obtains the matrix A from the equation. Finally, the data owner can get the matrix H as H = AT A 

Finally, the data owner upload the encrypted data set D and the matrix H to the service provider. 

– Regression(D , H ). The service provider get the encrypted data set D  = {(c 1 , y1 ), (c 2 , y2 ), · · · , (c n , yn )} and the matrix H from the data owner and apply the regression algorithm, which includes the steps as below: (1) Generate a vector θ  randomly and choose a threshold t. (2) Define the loss function over encrypted data as J  (θ  ) = (

1 n 1 ) ( θ T H c i − yi )2 . i=1 w 2 2n

96

H. Yang et al.

(3) Upload the θ  based on gradient descent method as below: θ k = θ k−1 − α

∂J  (θ  ) , ∂θ 

where θ k is the value of the k th iteration. (4) Repeat step (3) until the value of the loss function satisfies the condition as below: |J  (θ k ) − J  (θ k−1 )| < t. – BackResults(θ  ). From Regression the cloud will get the encrypted parameters. Then, the cloud return it back to the data owner.

4

Discussion

We have shown how to achieve a basic protocol for linear regression over encrypted data in outsourced environment. In this section, we will give the correctness analysis of our protocol and give a brief introduction about how to use the encrypted results. 4.1

Loss Function over Encrypted Data

In this section, we introduce the correctness of loss function over encrypted data, and verify that the following equation holds.

J  (θ  ) = ( =( =(

n 1  1 T ) ( θ Hc i − yi )2 2n i=1 w2 n 1  1 2 T ) ( w θ ci − yi )2 2n i=1 w2 n 1  T ) (θ c i − yi )2 2n i=1

= J(θ) As we can see, the loss function on the encrypted data is equal to the loss function on the plaintext. 4.2

Encrypted Parameters

In this section, we will discuss the relationship between encrypted parameters θ  and encrypted data. First of all, We analysis loss function of plaintext. The loss function of plaintext is shown as follow: J(θ) = (

1 n ) (θ T x i − yi )2 i=1 2n

Efficient and Secure Outsourced Linear Regression

97

Gradient descent is one of the most commonly used methods in solving the model parameters. When solving the minimum value of the loss function, we can get the minimum value of loss function and the model parameters. θ = [θ1 , θ2 , . . . , θd ] where the iterative equation is given as below: ∂J(θ) θ := θ − α ∂θ ⎡ ⎤ ⎤ ⎡ ⎤ ⎡ n (θ T x i − yi ) ∗ xi1 θ1 θ1 i=1 n T ⎢θ 2 ⎥ α ⎢ ⎥ ⎢θ 2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ i=1 (θ x i − yi ) ∗ xi2 ⎥ ⎥ ⎢ .. ⎥ := ⎢ .. ⎥ − ⎢ .. ⎣. ⎦ n⎣ ⎦ ⎣. ⎦ . n T θd θd (θ x − y ) ∗ x i i id i=1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ θ1 θ1 xi1 n ⎢θ 2 ⎥ α  ⎢θ 2 ⎥ ⎢xi2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ (θ T x i − yi ) ⎢ . ⎥ ⎢ .. ⎥ := ⎢ .. ⎥ − ⎣ . ⎦ n i=1 ⎣. ⎦ ⎣ .. ⎦ θd θd xid n

θ := θ −

α T (θ x i − yi )x i , n i=1 

where α is the iteration step. Note that θ is a linear combination of x i when the initial value is set to the vector 0. Linear combination is supported by Vector Homomorphic Encryption, and thus we can get the results on the encrypted domain.

5

Security Analysis

In this section, we give the security analysis for ESLR, focusing on the encrypted  database D = {c 1 , c 2 , · · · , c n } and the matrix H . The honest-but-curious cloud server could not threat the privacy of the data owner, i.e., the cloud could not recover plaintexts database D = {x 1 , x 2 , · · · , x n }. First of all, c i is the ciphertext of x i by the encryption of VHE, for i = 1, 2, · · · , n. For convenience, we omit the subscripts, denoting as c = VHE .E (x , S ), where S is the secret key. Therefore, we can ensure the confidentiality of x , only if the encryption scheme VHE is secure and the secret key S is not known by the cloud. Of course, we may suppose that the secret key S is stored privately by the data owner, and thus the cloud could not get it. Hence, we would focus on the security of VHE. As shown in [17], the security of VHE could reduce to the problem of the learning with errors (LWE ). It is well known the LWE problem is as hard to solve as several worst-case lattice problems [14]. As a resut, the intracibility of LWE assures the security of VHE. However, in order to evaluate the distance of two ciphertexts vectors, we introduce a special matrix H . It is natural to consider if H may bring certain

98

H. Yang et al.

unknown privacy risk. For example, on one hand, to calculate H , we first solve the equation I ∗ = AM to obtain A, then compute H = AT A to get H . On the other hand, according to VHE, for the ciphertext c and the plaintext x , c = M (wx)∗ holds. As known, the cloud has H and c. If the cloud combines the equations as follows, it seems that the cloud could recover the plaintext x . ⎧ T ⎪ ⎨H = A A I ∗ = AM ⎪ ⎩ c = M (wx ) In the following, We would give positive answer about the challenge. The analysis demonstrates that the cloud could not yet recover the plaintext x from the ciphertext c by exploiting H . As is known, for a random orthogonal matrix Q, satisfying the relation Q T Q = I , where I is an identity matrix, we have H = AT A = AT Q T QA = AT I A =H It is clear that the equation H = AT A has infinite solutions for A since Q is randomly chosen. Therefore, the cloud could not extract the matrix A from the Norm-matrix H . Futhermore, without knowing A, the cloud could not yet get M . And the cloud could not recover the plaintext x from the ciphertext c. As a result, we achieve the privacy of the database D.

6

Performance Evaluation

In this section, we evaluate the proposed linear regression protocol. Our data sets come from the UCI repository [1], and the experiment environment includes a data owner and a service provider. Python language is used on a Window 10 machine with i3-4130 CPU @1.40 GHz and 4 GB RAM for a user, and the server is a Linux machine with an Intel Xeon E5-2430 v2 CPU @2.5 GHz and 16 GB RAM running Ubuntu 14.04 LTS. The user acts as a data owner and a data user, and the server acts as a service provider. In the following, we will conduct the simulation experiments in terms of the time cost, accuracy, and communication overhead. 6.1

Time Cost and Accuracy

Firstly, we evaluate the time cost by the comparison of running time between plaintext and ciphertext. As illustrated in Fig. 2, we choose 4 data sets to verify our protocol from the UCI repository, and can see that the linear regression

Efficient and Secure Outsourced Linear Regression

(a) dataset 1

(b) dataset 2

(c) dataset 3

(d) dataset 4

99

Fig. 2. Comparison of running time between plaintext and ciphertext

(a) dataset 1

(b) dataset 2

(c) dataset 3

(d) dataset 4

Fig. 3. Comparison between real results and predicted results in encrypted domain

100

H. Yang et al.

on ciphertext is a little slower than that on plaintext. However, the result is acceptable, and it has almost the same results for the data sets between the plaintext and the ciphertext. Then, we show the comparison of accuracy between the real results and the predicted results of the four different data sets in the ciphertext domain. As illustrated in Fig. 3, we can see that the predicted results almost coincide the actual results in the ciphertext domain. Furthermore, we choose the Mean Squared Error, Root Mean Squard Error, Mean Absolute Error (MAE) and R Squared (R-S), as the indexes of linear regression to evaluate our model. As seen in Table 2, compared to results in the plaintext domain, our protocol has almost achieved the same prediction performance. This shows that our model has a good performance on the ciphertext domain. Table 2. Clustering time and iterations Data Plaintext on dataset 1

RMSE MAE R-S

9.964 3.156

2.427 0.838

Ciphertext on dataset 1 15.253 3.905

2.873 0.768

Plaintext on dataset 2

0.007 0.081

0.068 0.972

Ciphertext on dataset 2

0.025 0.157

0.133 0.895

Plaintext on dataset 3

23.999 4.899

3.790 0.497

Ciphertext on dataset 3 24.987 4.999

3.897 0.475

Plaintext on dataset 4

6.2

MSE

19.134 4.374

3.499 0.782

Ciphertext on dataset 4 20.134 4.564

3.619 0.768

Communication Cost

In this section, we will discuss the communication cost of our protocol. In our protocol, the communication cost mainly come from ciphertext and the matrix H which is used to define the loss function. Firstly, for n records and every record have m dimensions, it will produce O(m(n + 1)) communication traffic overhead when the data items are encrypted. Secondly, it will generate O((n + 1)2 ) communication traffic overhead for matrix H . That means that it will produce O(m+n+1)(n+1) communication traffic overhead totally on encrypted domain. On the other hand, the complexity of plaintext stage is O(mn) for the same data sets. In fact, m is always far greater than n because of dimension disaster problem [11]. So communication traffic overhead between plaintext and ciphertext is almost same when m is far greater than n and m is big enough.

Efficient and Secure Outsourced Linear Regression

7

101

Conclusion

In this paper we have proposed an efficient and secure linear regression protocol over encrypted data using the vector homomorphic encryption. Especially, we have given a good solution to the challenging problem of privacy-preserving gradient descent method. Performance evaluation shows that it has high accuracy and low computation and communication cost. As we know, many machine learning algorithm base on gradient descent method. In the future, we will use this method on other machine learning algorithms. Acknowledgement. Our work is supported by of the National Key Research and Development Program of China (2017YFB0802003), the National Natural Science Foundation of China (U1633114) and the Sichuan Science and Technology Program (2018GZ0202).

References 1. Asuncion, A., Newman, D.: UCI machine learning repository (2007) 2. Ben-David, A., Nisan, N., Pinkas, B.: FairplayMP: a system for securemulti-party computation. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 257–266. ACM (2008) 3. Dankar, F.K., El Emam, K.: The application of differential privacy to healthdata. In: Proceedings of the 2012 Joint EDBT/ICDT Workshops, pp. 158–166. ACM (2012) 4. Centers for Disease Control and Prevention, et al.: HIPAA privacy rule and public health. guidance from CDC and the us department of health and human services. MMWR Morb. Mortal. Wkly. Rep. 52(Suppl. 1), 1–17 (2003) 5. Du, W., Atallah, M.J.: Secure multi-party computation problems and their applications: a review and open problems. In: Proceedings of the 2001 Workshop on New Security Paradigms, pp. 13–22. ACM (2001) 6. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4 1 7. Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. R Theor. Comput. Sci. 9(3–4), 211–407 (2014) Found. Trends 8. Fletcher, R., Powell, M.J.: A rapidly convergent descent method for minimization. Comput. J. 6(2), 163–168 (1963) 9. Goldreich, O.: Secure multi-party computation. Manuscript. Preliminary version, pp. 86–97 (1998) 10. Halevi, S., Shoup, V.: Helib (2014). Retrieved from HELib: https://github.com. shaih/HElib 11. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998) 12. Lee, L.M., Gostin, L.O.: Ethical collection, storage, and use of public health data: a proposal for a national privacy protection. Jama 302(1), 82–84 (2009) 13. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 94–103. IEEE (2007)

102

H. Yang et al.

14. Regev, O.: On lattices, learning with errors, random linear codes, andcryptography. J. ACM 56(6), 1–40 (2009) 15. van Dijk, M., Gentry, C., Halevi, S., Vaikuntanathan, V.: Fully homomorphic encryption over the integers. In: Gilbert, H. (ed.) EUROCRYPT 2010. LNCS, vol. 6110, pp. 24–43. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3642-13190-5 2 16. Wold, S., Ruhe, A., Wold, H., Dunn III, W.: The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput. 5(3), 735–743 (1984) 17. Zhou, H., Wornell, G.: Efficient homomorphic encryption on integer vectors and its applications. In: Information Theory and Applications Workshop (ITA), 2014, pp. 1–9. IEEE (2014)

New Multi-objectives Scheduling Strategies in Docker SwarmKit ´ Tarek Menouer, Christophe C´erin(B) , and Etienne Leclercq {tarek.menouer,christophe.cerin, etienne.leclercq}@lipn.univ-paris13.fr University of Paris 13, Sorbonne Paris Cit´e, LIPN/CNRS UMR 7030, 93430 Villetaneuse, France

Abstract. This paper presents new multi-objectives scheduling strategies implemented in Docker SwarmKit. Docker SwarmKit is a container toolkit for orchestrating distributed systems at any scale. Currently, Docker SwarmKit has one scheduling strategy called Spread. Spread is based only on one objective to select from a set of cloud nodes, one node to execute a container. However, the containers submitted by users to be scheduled in Docker SwarmKit are configured according to multiobjectives criteria, as the number of CPUs and the memory size. To better address the multi-objectives configuration problem of containers, we introduce the concept and the implementation of new multi-objectives scheduling strategies adapted for Cloud Computing environments and implemented in Docker SwarmKit. The principle of our multi-objectives strategies consist to select a node which has a good compromise between multi-objectives criteria to execute a container. The proposed scheduling strategies are based on a combinaison of PROMETHEE and Kung multi-objectives decision algorithms in order to place containers. The implementation in Docker SwarmKit and experiments of our new strategies demonstrate the potential of our approach under different scenarios. Keywords: Systems software · Scheduling and resource management Container technology · Cloud computing Application of parallel and distributed algorithms

1

Introduction

Nowadays, cloud computing is the commercial subscription to external services. Its principle is based on pay-for-use model that can affect different elements such as the requested application, data storage capacity, memory processing and number of users. Different forms of cloud computational resources exist such as virtual machines (VMs), containers, or bare-metal resources, having each their own characteristics. Container technology is relatively new in production systems but it is not a new concept. It has increasingly grown up in cloud environment. c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 103–117, 2018. https://doi.org/10.1007/978-3-030-05057-3_8

104

T. Menouer et al.

Docker SwarmKit [25] is a toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, containers scheduling and more. In the containers context, it selects the first Docker container that must be executed using the classical FIFO (First In First Out) strategy. Then, it chooses the appropriate cloud node from a set of nodes using the Spread scheduling strategy. The principle of Spread is to execute a container on the node having the least number of containers. Spread is a mono-based objective scheduling strategy. However, the containers scheduled by Docker SwarmKit are configured regarding multi-objectives criteria, like the number of the used CPUs and the size of the used memory. To take into consideration the multi-objectives configuration approach, we present in this paper the idea for new multi-objectives scheduling strategies implemented in Docker SwarmKit. The goal is to address the problem of companies that manage a private infrastructure of nodes i.e. a cloud platform, and would like to optimize the scheduling of several containers submitted online by users. In this paper, for the sake of simplicity, each container is scheduled by taking into consideration two criteria: (i) the number of CPUs and (ii) the memory size. Indeed, the overall motivation for such multi-objectives scheduling strategies comes from the industrial Fonds Unique Interminist´eriel (FUI-22) Wolphin1 project, a collaborative industrial project oriented towards the themes of orchestration and optimization of the execution of containers. Ultimately, the project, supervised by Alterway2 , aims to provide an efficient solution for hypervision and invoicing of container-oriented infrastructure. In fact, in the Wolphin project the Alterway company would like to improve the Docker SwarmKit scheduler to optimize the scheduling of containers submitted online by users. AlterWay would like to reduce the cost of the infrastructure by choosing the most appropriate nodes to maximize the number of executed containers during a time period. Each container should be executed in a node with a good compromise between its availability on the number of CPUs cores and its free memory. This paper demonstrates that we have room for improvements in the Docker SwarmKit toolkit. We propose to particularize two multi-objectives decision algorithms called PROMETHEE and Kung. These algorithms are used to select, for each submitted container, the node that must execute it, according to a “good” compromise between multiple criteria. This is a first step for going on high dimensional decision support for scheduling containers inside the concrete Docker SwarmKit toolkit. The essence of cloud computing is precisely to be able to deal with the challenging problem of multiple objectives in heterogeneous and dynamic environments for the benefit of the user and/or the platform. The organization of the paper is as follows. Section 2 presents some related works. Section 3 describes our multi-objectives scheduling strategies based on PROMETHEE and Kung algorithms. Section 4 shows a comparative example between the proposed multi-objectives scheduling strategies and the Spread 1 2

https://www.alterway.fr/wolphin-2-0-laureat-du-fui-22/. https://www.alterway.fr.

Multi-objectives Scheduling Strategies

105

strategy which is the default SwarmKit scheduling strategy. Section 5 introduces exhaustive experiences that allow the validation of our strategies. Finally, a conclusion and some future works are given in Sect. 6.

2

Related Work

In the literature, many problems of resources allocation, or placement of user’s containers or requests refer to the same class of scheduling problems. They consist generally in associating a user’s container with one or several computing cores to be executed in a particular node. Most of these problems are NP difficult [20]. In this general context, we present in the forthcoming subsection several proposed scheduling systems and computing frameworks. We present also in Subsect. 2.2, some multi-objectives studies proposed in the literature. Subsect. 2.3 discusses quickly about machine learning techniques for large-scale multi-objectives optimization. Then, we conclude this section by a positioning in the Subsect. 2.4. 2.1

Containers Scheduling and Cloud Computing

In the literature, there are some frameworks that have proposed to schedule containers on cloud computing [5,17,23,24]. To give a positioning of our work compared to an industrial point of view, we document, as examples of concrete projects, the schedulers inside Google Kubernetes [24], Docker SwarmKit [25] and Apache Mesos [23]. Google Kubernetes [24] is a scheduler framework which represents an orchestration system for Docker containers based on pods concept. Pods are a group of one or more containers such as Docker containers. They are always co-located, co-scheduled and run in a shared context. Moreover, they will be run on the same physical or virtual machine (node). The principle of the Google Kubernetes scheduling can be summarized in two steps. The first step consists to classify all nodes to remove nodes that do not meet certain requirements of the pod. The second step consists to classify the remaining nodes using priorities to find the best fit to execute a pod. A priority is a key/value representing the name of the priority from the list of existing ones and its weight. For each remaining node, a priority function gives a score which scales from 0 to 10. Each priority function is weighted by a positive number and the final score of each node is calculated by adding up all the weighted scores. When all scores of all nodes are calculated, Google Kubernetes chooses the node with the highest score to run the container. Docker SwarmKit [25] is an important container scheduler framework developed by Docker. It has two steps to finally choose which node will execute the container. First, it uses filters to select suitable nodes to execute the container according to the number of waiting CPUs cores and the free memory. Then, it uses, according to a Spread scheduling strategy, the most suitable node to execute the selected container. The principle of Spread strategy is to execute a container on the node having the least number of containers. The goal of

106

T. Menouer et al.

Spread is to give a “good” load balancing of containers between all nodes of the infrastructure. Mesos system [23] for example is delegating control over scheduling to the frameworks because many frameworks already implement sophisticated scheduling [9]. The Apache Mesos [23] framework has a native Docker support which offers many features in terms of scheduling such as constraints, discovery service and load balancing [9]. It is based on four elements to schedule containers on the cluster. Zookeeper for example helps Marathon to find the address of Mesos master. Marathon starts, monitors and scales the containers. The Mesos master sends the tasks assigned to a node and informs Marathon if there is a node having some free resources. Mesos slaves represent the set of nodes used to execute containers. There exists also some studies related to resource management as studies presented in [5,11,15]. Choi et al. [5] propose a framework which provides useful resource management functions, and more importantly it is possible to apply customized scheduling in local environment. By using this framework, cloud providers or researchers can optimize resources for their purpose. Jimenez et al. [11] introduce a resource monitoring agent for resource management of containers environment. The advantage of their approach is that it allows the monitor to assign resource of each container through the proposed agent. Medel et al. [15] inovate with a client-side scheduling approach in Kubernetes that aims to reduce the resource contention phenomenon in container technologies. The principle of the authors approach is to make use of application characterization in terms of the usage of resources, and extends the Kubernetes scheduler so that it can take better allocation decisions on containers based on such characterization. The application characterization consists in dividing applications in two categories, namely high and low usage of resources. The classification process of applications is delegated to the client or developer which provides the category which fits better to the application. 2.2

Short Overview of Multi-objectives Related Problems

Combinatorial and discrete optimization problems such as routing, task allocation, and scheduling are important optimization applications in the real world. Traditionally, the time required to solve a combinatorial problem may increase exponentially in the worst case, thereby making them computationally too costly. Moreover, if the optimization involves multiple objectives, the process becomes more complex and difficult to solve [22]. Xing et al. [21] present a simulation model to solve a multi-objectives Flexible Job-Shop Scheduling Problem (FJSSP). The FJSSP is very important in the fields of combinatorial optimization and production management. Throughout the experiments, authors showed that multi-objectives evolutionary algorithms are very effective for solving the FJSSP.

Multi-objectives Scheduling Strategies

107

Knowles et al. [12] propose a Pareto archived evolution strategy to solve multi-objectives optimization problem. The algorithm introduces a Pareto ranking-based selection method and couples it with a partition scheme in objective space. It uses two different archives to save non-dominated solutions. Chang et al. [4] proposed a new algorithm, called the sub-population genetic algorithm II, to solve multi-objectives combinatorial problems. The algorithm develops a mechanism to exchange information among sub-populations. Once a sub-population reaches a better non-dominated solution, other sub-populations will apply them directly in their search space. In this way, all individuals in the same population will be guided to search toward the true Pareto front. 2.3

Multi-dimensional Search

Machine learning algorithms for large-scale multi-objectives optimization may also be considered as techniques to accelerate the search of solutions in multidimensional space. We assume that solving large-scale multi-objectives scheduling problems on large-scale systems remains challenging. Such general techniques from the field of machine learning are surrogate meta-models, multi-armed bandits [14], landscape analysis [6] and online/offline automatic algorithm selection and configuration [2]. Our work may be considered as a practical work to investigate the limits of known multi-objectives optimization techniques to solve concrete problems inside the popular Docker SwarmKit. Once the limits are isolated and understood we can better choose, in the future, another appropriate technique for multidimensional spaces. 2.4

Positioning

To the best of our knowledge, all of the studies proposed previously in the context of scheduling in cloud use a mono-based objective strategy to select a node which executes a container. However, the novelty of this paper is to improve the Docker SwarmKit scheduling system with a new multi-based objectives strategies to select for each submitted container a node that will executes it. Indeed, this paper is an extension of a preliminary paper presented in [3] where the context was a naive scheduling strategy based on a mono-objective scheduling strategy implemented in Docker Swarm.

3

Multi-objectives Scheduling Strategies

In following we start by presenting the PROMETHEE scheduling strategy. Then, we present the Kung scheduling strategy. After the introduction of each multiobjectives scheduling strategy, we give an illustrated example which explains the operation of each strategy.

108

3.1

T. Menouer et al.

PROMETHEE Scheduling Strategy

The first proposed scheduling strategy is based on PROMETHEE II (Preference Ranking Organization METHod for Enrichment Evaluations) algorithm [18]. PROMETHEE II is a multi-objectives decision algorithm that permits the building of an outranking between different alternatives [18]. It is used in this step because it allows to provide a node which must execute a container with a “good” compromise between: (i) number of waiting CPUs and (ii) unused memory space. Indeed, the PROMETHEE II has been used with success to solve many problems [1]. In our case, it is based on a comparison, pair by pair, of possible decisions (nodes) along number of waiting CPUs and the size of the free memory criteria. Each criterion can be evaluated according to two functions (minimization or maximization). The use of the PROMETHEE II algorithm requires for each criterion two informations: a weight and a preference function. In our context, the weight in all criteria is the same and equal to 1. The preference function characterizes the difference for a criterion between the evaluations obtained by two possible nodes into a preference degree ranging from 0 to 1. In [10], six basic preference functions have been proposed. In this work, for the sake of simplicity, we use the usual preference functions. To summarize, the PROMETHEE II algorithm is composed of four steps [19] and it is used as follows: 1. Compute for each pair of possible nodes (nodea and nodeb ) and for each criterion (number of waiting CPUs or free memory size), the value of the preference degree. Let gj (nodea ) be the value of a criterion j for a node nodea . We note dj (nodea , nodeb ) (dj (nodea , nodeb ) = gj (nodea ) − gj (nodeb )), the difference of value of a criterion j for nodea and nodeb . Pj (nodea , nodeb ) is the value of the preference degree of a criterion j for nodea and nodeb . The preference function used in this paper to compute these preference degrees is defined such as:  0 dj ≤ 0 Pj (dj ) = 1 dj > 0 2. Compute for each pair of possible nodes, a global preference index. Let C be the set of considered objectives criteria (number of waiting CPUs and free memory size) and wj the weight associated to the criterion j. The global preference index for a pair of possible nodea and nodeb is computed as follows:  π(nodea , nodeb ) = Wj × Pj (nodea , nodeb ) j∈C

3. Compute for each possible node the positive outranking flow φ+ (nodea ) and the negative outranking flow φ− (nodea ). Let A be the set of nodes with size of n. The positive and negative outranking flow of nodes are computed by the following formula: φ+ (nodea ) =

1  π(nodea , x) n−1 x∈A

Multi-objectives Scheduling Strategies

and φ− (nodea ) =

109

1  π(x, nodea ) n−1 x∈A

4. Compute the outranking flows to establish a complete ranking between nodes. The ranking is based on the net outranking flows φ(nodea ) which is computed as follows: φ(nodea ) = φ+ (nodea ) − φ− (nodea ). In our work, the first node returned by PROMETHEE II is the node that has the highest value in case of minimization of multi-objectives criteria of the net outranking. Example of How PROMETHEE Scheduling Strategy Works: Assume that at time t0 , we have a container Cx which need 8 CPUs and 8 GB of memory. We assume also that from all nodes of the infrastructure there are just three nodes (nodea , nodeb and nodec ) which can execute Cx . The availability of each node in term of waiting number of CPUs and the size of free memory are presented in Table 1. Table 1. Nodes configurations in term of waiting CPUs and free memory size Nodes Number of waiting CPUs Memory size na

10

10

nb

20

40

nc

30

40

nd

40

50

As explained before, to select the first node that must execute the container Cx using PROMETHEE scheduling strategy (with a minimization function on all multi-objectives criteria), we start by computing for each pair of nodes a difference value of multi-objectives criteria dx (nodei , nodej ) and the preference degree Px (nodei , nodej ). Then, the system calculates the global preference index φ(nodei ). For example, in Table 2 with the first pair nodes (nodea , nodeb ), the difference value of waiting CPUs criterion is d(nodea , nodeb ) = 10 − 20 = −10. In this case the difference value is negative, using our usual preference function, the preference degree equals to 0. As in our work the weight of all criteria is the same and equal to 1, the global preference index of the first pair nodes (nodea , nodeb ) = 1 × 0 + 1 × 0 + 1 × 1 = 1. Finally, to get the rank of nodes and select the node which can execute a container, our strategy calculates the positive and negative outranking flow and the net outranking flow parameters. Table 3 shows how our strategy calculates these different parameters. For example, for nodea , the positive outranking flow (φ+ ) is 21 (1 + 1) = 1. The negative outranking flow (φ− ) is 12 (2 + 1) = 1.5. The net outranking flow φ (φ = φ+ − φ− ) is −0.5 (1 − 1.5). Using PROMETHEE strategy, the nodea is the first selected node with the minimum net outranking flow.

110

T. Menouer et al.

Table 2. Computing the difference values, preference degree and preference index value for a set of pair nodes Pair of nodes

Difference values

Preference degree

Weight

Number of waiting CPUs

Memory size

Number of waiting CPUs

Memory size

Number of waiting CPUs

Memory size

Preference index value

d(na , nb )

−10

−30

0

0

1

1

0

d(na , nc )

−20

−30

0

0

1

1

0

d(na , nd )

−30

−40

0

0

1

1

0

d(nb , na )

10

30

1

1

1

1

2

d(nb , nc )

−10

0

0

0

1

1

0

d(nb , nd )

−20

−10

0

0

1

1

0

d(nc , na )

20

30

1

1

1

1

2 1

d(nc , nb )

10

0

1

0

1

1

d(nc , nd )

−10

−10

0

0

1

1

0

d(nd , na )

30

40

1

1

1

1

2

d(nd , nb )

20

10

1

1

1

1

2

d(nd , nc )

10

10

1

1

1

1

2

Table 3. Computing of the net outranking flow for each node Nodes φ+ φ− φ

3.2

−3

Rank

na

0

3

nb

1

1.5 −0.5 2

nc

1.5 1

0.5

3

nc

3

3

4

0

1

Kung Scheduling Strategy

The second multi-objectives scheduling strategy is based on Kung algorithm [13]. It is among the best algorithms used in the multi-objectives criteria context [7]. As presented in [7], Kung algorithm firstly sorts the population (nodes that can execute a container) in descending order according to the first criterion (number of waiting CPUs). Thereafter, the set of nodes are recursively halved as Top half (T) and Bottom half (B) sub set of nodes. As T is better in objectives in comparison to B in first objective (number of waiting CPUs), so we check the B for domination with T. The solution of B which are not dominated by solutions of T are merged with members of T to form merged set of nodes M. In our context, we use a minimization function. That mean a solution x1 is better that other solution x2 , if the value of x1 is smaller than the value of x2 . The complete algorithm can be summarized in two steps: – Sort the nodes according the descending order of importance in the number of waiting CPUs criterion and rename the population as P of size N.

Multi-objectives Scheduling Strategies

111

– Front(P): if |P | = 1, return P as the output of Front(P). Otherwise, T = Front(P 1 − P |P/2| ) and B = Front(P |P/2|+1 − P P ). IF ith non-dominated solution B is not dominated by any non-dominated solution of T, create a merged set M = {T U i}. Finally, return M as output of Front(P). We say that a solution x1 dominates an other solution x2 if two conditions are satisfied: 1. Solution x1 is no worse than x2 in all multi-objectives criteria; 2. Solution x1 is strictly better than x2 in at least one objective criterion. If a solution x1 dominates an other solution x2 ⇔ the solution x2 is dominated by the solution x1 . In our context, the goal of Kung algorithm is to select a set of nodes with a “good” compromise between the availability of CPUs cores and the free memory. Then, our strategy returns the first node that can execute a container from the set of nodes returned by the Kung strategy.

Fig. 1. Example of Kung strategy (Color figure online)

Example of How Kung Scheduling Strategy Works: Assume that at time t0 , we have a container Cx which need 8 CPUs and 8 GB of memory. We assume also that from all nodes of the infrastructure there are just three nodes (nodea , nodeb and nodec ) which can execute Cx . The availability of each node in term of number of waiting CPUs and memory size are presented in Table 1 (the same table as the table presented previously in Sect. 3.1). As explained before, to select the first node that must execute the container Cx using Kung strategy, we start by ordering in descending order all nodes according to the value of the waiting CPUs criterion. Then, the set of nodes are recursively halved as Top (T) and Bottom (B) sub set of nodes as it is shown in the Fig. 1 with red color. After applying the second step of the Kung algorithm as presented in the Fig. 1 with blue color, the selected node is the nodea . We note that the Kung and PROMETHEE scheduling strategies give the same result and choose the nodea to execute the container Cx .

112

4

T. Menouer et al.

Comparative Example Between Scheduling Strategies

Figures 2 and 3 show a comparison between the scheduling of 3 containers with Spread strategy (Fig. 2) and with multi-objectives strategy (PROMETHEE or Kung) (Fig. 3).

Fig. 2. Scheduling with Spread strategy

Fig. 3. Scheduling with multi-objectives strategy

The principle of Spread strategy is to execute a container on the node having the least number of containers. For example, with n nodes, Spread selects nodes with the following order: nodei%n , node(i+1)%n , node(i+2)%n , · · · . In this comparison we suppose that we have 2 nodes with the same configuration (24 waiting CPUs and 90 GB of memory). We suppose also that we have 3 containers with the following configurations: – Container 1: 16 CPUs and 60 GB of memory; – Container 2: 8 CPUs and 30 GB of memory; – Container 3: 24 CPUs and 90 GB of memory. In Fig. 2, the container 1 is executed in nodea . The container 2 is executed in nodeb . When the container 3 is presented, with Spread strategy container 3 can not be executed because there are no node which has enough of resources to execute it. However, in Fig. 3, the first container is executed on nodea . After that, the PROMETHEE and Kung strategies select the nodea to execute the container 2. When the container 3 is presented, it is directly executed in nodeb .

Multi-objectives Scheduling Strategies

5

113

Experimental Evaluation

In this section we introduce experiences with our multi-objectives scheduling strategies implemented in Docker SwarmKit to check if it meets our expectations. For these experimentations, we do experiences inside the Grid5000 platform [8], an experimental large-scale testbed for distributed computing in France. For our experimental evaluation, we reserved an infrastructure composed of a total of 128 computing cores, distributed over 4 nodes (Intel Xeon CPU), each node contains 32 cores and 130 GB of memory. The following experimental evaluation is performed according to the submission of 18 containers with an execution time equal to 3 minutes. Each container is submitted by one of the following three users, each user has a particular container configuration: – User 1: for each container, he needs 30 CPUs and 120 GB of memory, – User 2: for each container, he needs 20 CPUs and 80 GB of memory, – User 3: for each container, he needs 10 CPUs and 40 GB of memory. The performance of our multi-objectives scheduling strategies is based on two submitting containers types: (i) containers submitted at the same time, i.e. each user submits 6 containers at the same time; and (ii) containers submitted online with a fixed frequency equal to 1 minute, i.e. each 1 min, 3 containers are submitted by 3 different users. The first type of experiments with submission at the same time stresses the scheduling system. The second type of experiments with submission online represents a “normal” operating mode. 5.1

Distribution of Containers in Different Nodes

In this subsection we present the distribution of containers in our 4 nodes according to the submission type and the three scheduling strategies: (i) Spread; (ii) PROMETHEE; and (iii) Kung. Containers Submitted at the Same Time: Figures 4, 5 and 6 show the distribution of containers submitted at the same time in 4 nodes using Spread strategy (Fig. 4), PROMETHEE strategy (Fig. 5) and Kung strategy (Fig. 6).

Fig. 4. Distribution of containers submitted at the same time in 4 nodes using Spread strategy

114

T. Menouer et al.

Fig. 5. Distribution of containers submitted at the same time in 4 nodes using PROMETHEE strategy

Fig. 6. Distribution of containers submitted at the same time in 4 nodes using Kung strategy

Using PROMETHEE and Kung strategies, we note that the load of containers in each node is bigger than the load of containers with Spread strategy. This is a good property of our implementation, as expected.

Fig. 7. Distribution of containers submitted online in 4 nodes using Spread strategy

Containers Submitted Online: Figures 7, 8 and 9 show the distribution of containers submitted online in 4 nodes using Spread strategy (Fig. 7), PROMETHEE strategy (Fig. 8) and Kung strategy (Fig. 9). We can emit the same remark as the previous experimentation i.e. the load of containers with PROMETHEE and Kung is bigger than the load of containers with Spread strategy. 5.2

Comparison of Performance

In this subsection we compare the performance of 3 scheduling strategies (Spread, PROMETHEE and Kung) in 4 nodes according to the submission type. Table 4 shows a comparison of running time between Spread, PROMETHEE and Kung scheduling strategies according to the submission type. We note that the running time obtained with Spread strategy is always the longest. However, the running time of PROMETHEE and Kung strategies is almost the same.

Multi-objectives Scheduling Strategies

Fig. 8. Distribution of containers submitted online in 4 nodes using PROMETHEE strategy

115

Fig. 9. Distribution of containers submitted online in 4 nodes using Kung strategy

Table 4. Comparison of performance between 3 scheduling strategies Scheduling strategies Submission type At the same time Online Spread

747.49 (s)

747.29 (s)

PROMETHEE

558.62 (s)

622.26 (s)

Kung

559.11 (s)

621.46 (s)

We note also that sometimes the PROMETHEE running time is better (submission at the same time), and sometimes the Kung running time is better (submission online).

6

Conclusion

We have presented, in this paper, new multi-objectives scheduling strategies for Docker SwarmKit. Our new scheduling strategies are based on PROMETHEE and Kung multi-objectives algorithms. The principle of our strategies is to select from a set of nodes, a node to execute a container by taking into consideration multi-objectives criteria: (i) number of waiting CPUs and (ii) free memory size. The goal is to execute a container in node which has a good compromise between the availability of CPUs cores and the free memory size. Actually Docker SwarmKit uses a simple FIFO (First In First out) strategy to select the first container that must be executed from a set of containers saved in a queue. As a perspective, we propose to use the same principle as our multiobjectives strategies to select the first container that must be executed from a queue of containers. We have present previously in [16], a new scheduling and resources management system based on an economic model. To choose the node that must execute a request, the system presented in [16], uses the Bin Packing strategy. As an other perspective, we propose to use our new multi-objectives scheduling

116

T. Menouer et al.

strategies in the system proposed in [16] and compare the performance between the Bin Packing and the multi-objectives scheduling strategies. Acknowledgments. This work is funded by the French Fonds Unique Minist´eriel (FUI) Wolphin Project. We thank Grid5000 team for their help to use the testbed.

References 1. Behzadian, M., Kazemzadeh, R., Albadvi, A., Aghdasi, M.: Promethee: a comprehensive literature review on methodologies and applications. Eur. J. Oper. Res. 200(1), 198–215 (2010) 2. C´ aceres, L.P., Pagnozzi, F., Franzin, A., St¨ utzle, T.: Automatic configuration of GCC using irace. In: Lutton, E., Legrand, P., Parrend, P., Monmarch´e, N., Schoenauer, M. (eds.) EA 2017. LNCS, vol. 10764, pp. 202–216. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78133-4 15 3. C´erin, C., Ben-Abdaallah, W., Saad, W., Menouer, T.: A new docker swarm scheduling strategy. In: 7th International Symposium on Cloud and Service Computing, Kanazawa, Japan (2017) 4. Chang, P.-C., Chen, S.-H.: The development of a sub-population genetic algorithm II (SPGA II) for multi-objective combinatorial problems. Appl. Soft Comput. 9(1), 173–181 (2009) 5. Choi, S., Myung, R., Choi, H., Chung, K., Gil, J., Yu, H.: GPSF: general-purpose scheduling framework for container based on cloud environment. In: IEEE iThings and IEEE GreenCom and IEEE CPSCom and IEEE SmartData (2016) 6. Daolio, F., Liefooghe, A., V´erel, S., Aguirre, H.E., Tanaka, K.: Problem features versus algorithm performance on rugged multiobjective combinatorial fitness landscapes. Evol. Comput. 25(4), 555–585 (2017) 7. Ding, L., Zeng, S., Kang, L.: A fast algorithm on finding the non-dominated set in multi-objective optimization. In: The 2003 Congress on Evolutionary Computation, CEC 2003, vol. 4, pp. 2565–2571, December 2003 8. Grid5000: https://www.grid5000.fr/ 9. Grillet, A.: Comparaison of containers schedulers. Medium (2016) 10. Brans, J.-P., Mareschal, B.: Promethee methods - multiple criteria decision analysis: state of the art surveys. International Series in Operations Research & Management Science, vol. 78 (2005) 11. Jimenez, L.L., Simon, M.G., Schel´en, O., Kristiansson, J., Synnes, K., ˚ Ahlund, C.: CoMA: resource monitoring of docker containers. In: Proceedings of the 5th International Conference on Cloud Computing and Services Science (CLOSER 2015) (2015) 12. Knowles, J.D., Corne, D.W.: M-PAES: a memetic algorithm for multiobjective optimization. In: Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512), vol. 1, pp. 325–332 (2000) 13. Kung, H.T., Luccio, F., Preparata, F.P.: On finding the maxima of a set of vectors. J. ACM 22(4), 469–476 (1975) ´ Kwong, S., Zhang, Q.: Adaptive operator selection with bandits 14. Li, K., Fialho, A., for a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 18(1), 114–130 (2014)

Multi-objectives Scheduling Strategies

117

´ Rana, 15. Medel, V., Tol´ on, C., Arronategui, U., Tolosana-Calasanz, R., Ba˜ nares, J.A., O.F.: Client-side scheduling based on application characterization on Kubernetes. ´ (eds.) GECON 2017. LNCS, vol. 10537, In: Pham, C., Altmann, J., Ba˜ nares, J.A. pp. 162–176. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-680668 13 16. Menouer, T., Cerin, C.: Scheduling and resource management allocation system combined with an economic model. In: The 15th IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2017) (2017) 17. Peinl, R., Holzschuher, F., Pfitzer, F.: Docker cluster management for the cloudsurvey results and own solution. J. Grid Comput. 14(2), 265–282 (2016) 18. Deshmukh, S.C.: Preference ranking organization method of enrichment evaluation (PROMETHEE). Int. J. Eng. Sci. Inven. 2, 28–34 (2013) 19. Taillandier, P., Stinckwich, S.: Using the promethee multi-criteria decision making method to define new exploration strategies for rescue robots. In: International Symposium on Safety, Security, and Rescue Robotics (2011) 20. Ullman, J.: NP-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384– 393 (1975) 21. Xing, L.-N., Chen, Y.-W., Yang, K.-W.: Multi-objective flexible job shop schedule: design and evaluation by simulation modeling. Appl. Soft Comput. 9(1), 362–376 (2009) 22. Zhou, A., Qu, B.-Y., Li, H., Zhao, S.-Z., Suganthan, P.N., Zhang, Q.: Multiobjective evolutionary algorithms: a survey of the state of the art. Swarm Evol. Comput. 1(1), 32–49 (2011) 23. The apache software foundation. Mesos, apache. http://mesos.apache.org/ 24. Kubernetes scheduler. https://kubernetes.io/ 25. Swarm kit. https://github.com/docker/swarmkit/

Internet Performance Prediction Framework Based on PingER Dataset Wei Zhang, Xiaofei Xing(&), Saqib Ali, and Guojun Wang School of Computer Science and Technology, Guangzhou University, Guangzhou 510006, People’s Republic of China [email protected]

Abstract. The Internet performance directly affects the scalability, reliability and availability of the online applications. Delay of a few millisecond may cause companies lose millions of dollars. Therefore, Internet measurements are carried out to capture the performance of the Internet links worldwide. Most of the Internet performance monitoring frameworks are active in nature i.e., they can only capture the real-time performance of the Internet links. Thus, these monitoring frameworks are unable to forecast the near future performance of the Internet links in a region. Such estimates are quite critical for the network administrators to carry out bandwidth extensive experiments between different sites, policy makers to suggest future upgrades to the Internet infrastructures or streaming service providers to enhance the quality of service to their customers. Therefore, we analyze different machine learning algorithms including Multiple Linear regression, Random Forest algorithm, Gradient Boosting, and eXtreme Gradient Boosting to predict the performance of the Internet links using PingER (Ping End-to-End Reporting) dataset for the countries like China, India and Japan. Our experimental results show that the Multiple Linear regression has improved Internet performance prediction accuracy compared with the other methods. Our work can be utilized by the Internet service providers, streaming service providers or policymakers for the design, deployment, and evaluation of next-generation Internet infrastructure. Keywords: Multiple linear regression PingER

 Internet performance  Prediction

1 Introduction Internet traffic is increasing every day. The Internet is used in a wide variety of applications including corporate, education, entertainment, news, games, and social networking. It requires a lot of end-to-end link performance in terms of scalability, reliability, and performance. A delay of several hundred milliseconds may cause companies to lose millions of dollars, which may cause the game industry to lose a large number of users. For example, Singla [1] mentioned that a delay of 100 ms will cause Amazon to lose 1% of sales; in the search response, a 500-ms delay will result in a 1.2% decrease in Bing’s revenue and a 250-ms delay in winning competitors, so reducing latency will improve the user experience. On the other hand, the performance of the Internet is also directly related to the country’s key economic development © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 118–131, 2018. https://doi.org/10.1007/978-3-030-05057-3_9

Internet Performance Prediction Framework Based on PingER Dataset

119

indicators. According to the World Bank, a country’s economy has grown by 1.3%, while the speed of the Internet has increased by 10%. Therefore, the performance of the Internet plays an important role in our daily lives [2]. The Internet performance directly affect the reliability and availability of the Internet links. Therefore, Internet measurements are carried out to capture the performance of the links worldwide. The key Internet performance metrics includes, throughput, jitter, delay, packet loss, reachability, directivity etc. Many Internet performance monitoring frameworks are offered in literature, for example, SamKnows [3], BIS- mark [4], Dasu [5], Netradar [6], Portolan [7], RIPE Atlas [8], and perfSONAR [9] originally partially based on the PingER architecture [10]. These frameworks use different tools italic i.e., ping, mtr, cron, ntp, dig, netstat, iperf, and traceroute to mine the performance of the Internet links in real time. The findings of these frameworks are really critical for the Internet administrators and mangers to fine tune their infrastructures. Most of the above Internet performance monitoring platforms are active. They only capture the real-time performance of Internet software or hardware on congestion, bottleneck links, queue overflows, and errors [11]. However, these frameworks do not provide any information on the performance of future Internet links. Information on Internet performance prediction is necessary for optimization of resources for extensive bandwidth experiments conducted between research centers, laboratories, and universities. In addition, this prediction is also crucial for Internet managers, content service providers, and policy makers to make decisions in the future on upgrading the Internet infrastructure in the region. In this paper, because the Internet performance has more instability and unpredictability, and the traditional Internet performance analysis is only to analyze the current performance parameters, and form a performance log as a basis for analyzing the Internet operating conditions, so we will focus on Internet prediction. We will use historical Internet performance monitoring data in the PingER platform. First, we will preprocess the data. Then, we use machine learning algorithms such as Multiple Linear Regression, Random Forest, Gradient Boost, and XGBoost to build Internet performance prediction model. Finally, we use the Root Mean Square Error (RMSE), Error Rate and other indicators to compare and analyze the prediction accuracy under different models, and finally find a suitable prediction algorithm for PingER Internet performance data. The remaining paper is organized as follows. The related work is discussed in Sect. 2. Sections 3 and 4 mainly introduce the PingER framework and data. The proposed approach for predicting Internet performance is explained in Sect. 5. Section 6 is mainly about the results and discussion. Finally, Sect. 7 concludes the paper.

2 Related Work Internet performance prediction is usually based on observation sequence, so the method of Internet traffic prediction can also be used in Internet performance prediction. Currently, common prediction methods are Least Square, Regression, including Auto-Regressive and Moving Average (ARMA), Autoregressive Integrated Moving

120

W. Zhang et al.

Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), Time Series Seasonal Analysis (TSSA), etc. [12–17]. With the maturity of machine learning and data mining algorithms and their strong performance in various fields, many researchers have applied data mining methods to the prediction of Internet traffic in recent years. The study by Zheng [18] proposed an Internet traffic prediction model (ET-SVM) integrating inclusion test and support vector machine. Using some single models are used to prediction the Internet traffic, and the merits of the model are defined by the Root Mean Square Error (RMSE) of the predicting results, then the appropriate single model is selected by surrounding test, finally, the single model prediction results are combined by Support Vector Machine to get the final predicting result of network traffic. The study by Chen [19] proposed an Internet traffic prediction model (ELM-LSSVM) that combines limit learning machine and least squares support vector machine to improve the prediction accuracy of Internet traffic. The study by Hammami [20] give the classification model based on the flow prediction algorithm. The study by Liu [21] applied Back Propagation (BP) neural network to Internet traffic prediction. The study by Cui [22]. Elman neural network was used to replace BP neural network and achieved better results. Elman neural network is compared with BP difference: Elman Internet is the output of the hidden layer will feedback back to the input layer, as the input of the next Internet, based on the characteristics of Elman neural network can better capture the dynamic characteristics of time sequence, thus can better adapt to the prediction of time series. In this paper, we use four algorithms to building prediction model for the Internet performance data, then use the Root Mean Square Error (RMSE) to judge whether the result is good or bad, and finally select the optimal model to achieve the Internet performance prediction. 2.1

Random Forest Algorithm

The random forest algorithm was first proposed by Breiman and Cutler in 2001 [23]. It use the data sampled from the training set to construct the basic model, and the random forest samples all the attributes and extracts some of the attributes for input. Basic model. In order to reduce the generalization error, random forest algorithm has two layers of sampling, one layer is attribute sampling and the other is training set sampling. The specific algorithm of random forest is used in this paper as follows. This paper deals with a training set of 364 samples of one attribute, and trains K decision trees. The classification result with the most votes will be used as the output of the random forest. 2.2

Gradient Boost Algorithm

Gradient Boosting is a method of implementing Boosting [24, 25]. Its main idea is that each time the model is established, the gradient of the model loss function is established before the gradient is dropped. The loss function describes the degree of failure of the model, and the greater the loss function, the more error-prone the model is. If our model can make the loss function continue to decline, indicating that our model is

Internet Performance Prediction Framework Based on PingER Dataset

121

constantly improving, and the best way is to let the loss function in the direction of gradient. The advantages of this algorithm include no feature normalization, automatic feature selection, model interpretability, and multiple loss functions. 2.3

XGBoost Algorithm

XGBoost Extreme Gradient Rising is a massively parallel Boosted tree, an extension of the Gradient Boosting Algorithm [26]. In the same situation, XGBoost algorithm is more than 10 times faster than similar algorithms [27]. XGBoost can use the CPU multi-threaded parallel tree construction to support yet-another-resource-negotiator (YARN), message-passing-interface (MPI) and other platforms to achieve distributed computing, which can further improve the training speed. It’s advantage that efficient and more accurate.

3 The PingER Framework This paper is based on the PingER framework developed by Stanford University SLAC (Linear Acceleration Center). Originally, it was designed to facilitate the modern High Energy Nuclear and Particle (HENP) physics data-extensive experiments taking place among the SLAC, the Brookhaven National Laboratory (BNL) and the European Center for Particle Physics (CERN). However, for the last fifteen years, the focus of the project is to measure, store and analyze the historical end-to-end performance of the Internet links worldwide [28–30]. PingER consists of more than 50 Monitoring Agents (MAs) active in 20 countries of the world, as shown in Fig. 1. The PingER measurement cycle is activated by the MAs after every half hour. Each MA has a list of remote sites of interest. During each cycle, it sends a set of 100-byte ping requests and 1000-byte ping requests to each target in MA remote site list. The initial 100-byte ping is normally discarded as it is used to prime the routing caches. The cycle for each remote site stops when the MA receives 10 ping responses or it has issued 30 ping requests. The raw data collected for each set of pings consists of an MA name, it’s IP addresses followed by the target remote site name and IP address, the payload, time stamp, packets sent, packets received, minimum Round Trip Time (RTT), maximum RTT, average RTT followed by the sequence number of the received packets and the actual RTTs of the received packets. The data is publicly available through a web server (at each monitoring site) running a Common Gateway Interface (CGI) program. The main host at the SLAC works as a central data storage repository. It fetches all the raw data collected by each MA and stores it in a database on a daily basis. The data is analyzed to extract sixteen different Internet performance metrics, e.g., round trip time (average, maximum, and minimum), packet loss, jitter, unreachability, throughput, directivity, unpredictability, and quiescence for each day, month and year. Further, daily, monthly, and yearly summary reports are compiled for each MA and remote site pair. Currently, PingER’s data warehouse already has about 60 GB of Internet performance data. The storage method is stored in more than 100,000 text files in a compression ratio of 5:1.

122

W. Zhang et al.

Fig. 1. PingER’s MA and remote sites around the world (the red dot in the figure represents the monitoring agents and green represents the remote site). (Color figure online)

4 The PingER Data As mentioned before, PingER has a long history and data in the field of network performance monitoring, and there is currently no predicting service in the PingER framework. The main advantages of choosing PingER data are as follows: There are many historical data and it is easy to use. The historical data of Internet performance has a very important influence on the prediction of Internet performance. PingER has been operating since 1995 and has continuously monitored the Internet performance of over 700 sites [31]. We can follow hourly, daily, monthly or yearly view historical Internet data from any monitoring host to monitoring site. The data is compressed into a file according to the name of the performance index. At the same time, all Internet performance data is displayed on the PingER visual web page. Users can easily download data and conduct experimental analysis [32]. Furthermore, the PingER monitoring framework still has a large number of users since it’s adoption. It helps the development of Internets in various regions. However, the performance of the Internet is not predicted in the PingER platform. Therefore, the performance prediction of the PingER Internet is extremely meaningful. 4.1

Data Sources

As the number of Internet users in Asia is increasing, the total number of Internet users in China, India, and Japan accounted for 66% of the total Asian Internet users, as shown in Table 1 [33]. Therefore, measuring and predicting the Internets of these three countries is extremely important for understanding the Internet conditions in Asia as a whole. Therefore, this paper uses the average round-trip time as the experimental basis. The three selected links are shown in the Table 2.

Internet Performance Prediction Framework Based on PingER Dataset

123

Table 1. Asia Internet user and population data Asia China India Japan

Population 1,415,045,928 1,354,051,854 127,185,332

Internet users 772,000,000 462,124,989 118,626,672

Penetration (% Population) Users (% Asia) 54.6% 38.1% 34.1% 22.8% 93.3% 5.9%

Table 2. Data sources Monitoring-site EDU.SLAC.STANFORD.PINGER EDU.SLAC.STANFORD.PINGER EDU.SLAC.STANFORD.PINGER

4.2

Remote-site CN.EDU.N1 IN.MITPUNE.WWW JP.U-TOKYO.AC

Country China India Japan

Data Pre-processing

The downloaded file is in the format (.tsv), with a total of 1095 records for the three links. It contains the name of the Internet performance metric, monitoring host, remote site, date, and other related information. The missing values in the data are shown as (.) We first convert the source file to a comma-separated (.csv) file, and then replaces the missing value with the average of the link. After the replacement is complete, the original data distribution is shown in Fig. 2.

Fig. 2. The raw data distribution of China, India and Japan

5 Proposed Approach In this paper, the average round-trip time in the PingER monitoring framework is selected as the basic metric for Internet performance prediction for the three countries i.e., China, India, and Japan. Through the data collection, data missing value processing, feature selection, selection algorithm, and the establishment of prediction model, prediction, model evaluation and other steps to achieve the prediction of Internet performance, the specific process shown in Fig. 3.

124

W. Zhang et al.

Start

Collect Historical Data from PingER

Data Pre-processing

Feature Selection

linear Regression

Random Forest

Gradient Boosting

XGBoost

Data Prediction

Comparison of Minimum Root Mean Square Error The Model with The Lowest Root Mean Square Error Predicted Value

End

Fig. 3. Predicting process

5.1

Select the Characteristic Variable

Artificially constructing features from the original data set, combined with data analysis and data visualization of the Internet performance average round-trip time, finds that the current day’s Internet performance has a certain correlation with the previous days’ Internet performance. After selecting different eigenvalues for prediction, it was found that the prediction results obtained when the five eigenvalues were selected during the prediction process were optimal. That is, the average round-trip time of the current day as the dependent variable x1, x2, x3, x4, and x5, and the average round-trip time of the current day as the dependent variable Y. For example, May 19, 2018 was the dependent variable, and May 14, 15, 16, 17, and 18 were independent variables. 5.2

Establish a Multiple-Linear Regression Model

In this paper, we use multiple linear regression, random forest, gradient boost, and XGBoost to build the training model. Then we input the training data into each model and train it. After the training, we build the prediction model, and then input the test data set into the model to predict the results. Finally, the Root Mean Square Error (RMSE) was used to evaluate the performance of the results. The basic task of multivariate linear regression analysis is to establish a multiple linear regression equation of the dependent variable to multiple independent variables based on the actual observations of the dependent variable and multiple independent

Internet Performance Prediction Framework Based on PingER Dataset

125

variables; to test and analyze the integration of each independent variable on the dependent variable [34, 35]. The significance of the linear effect, choose the independent variables that have the significant linear influence on the dependent variable, establish the optimal multiple linear regression equation, evaluate the relative importance of each independent variable on the dependent variable, and determine the deviation of the optimal multiple linear regression equation. To study the relationship between the variation of two or more independent variables and one dependent variable under the condition of linear correlation, called multiple linear regression analysis, the mathematical equation obtained is a multiple linear regression model. The multiple linear regression model is an extension of the one-dimensional linear regression model. Let the dependent variable y and the independent variables x1 ; x2 ; x3 ; . . .:; xm1 have n groups of actual observation data, y is an observable random variable, which is subject to m−1 non-random factors x1 ; x2 ; x3 ; . . .:; xm1 and e effects of random factors. If y and x1 ; x2 ; x3 ; . . .:; xm1 have the following linear relationship as shown in Eq. (1): y ¼ b0 þ b1 x1 þ b2 x2 þ . . . þ bm1 xm1 þ e

ð1Þ

Where y is the dependent variable x1; x2; x3; . . .:; xm  1 is the independent variable, b 0 ; b 1 ; b 2 ; . . .; b ðm1Þ are m unknown parameters; e is the mean 0 and the variance is r2 An unobserved random variable of [ 0 is called an error term, and it is generally assumed that e  N ð0; r2 Þ. For nðn  pÞ independent observations, n sets of data samples are obtained as show in Eq. (2): 8 y1 ¼ b0 þ b1 x11 þ b2 x12 þ . . . þ bm1 xm1 þ e1 > > < y ¼ b þ b x21 þ b x22 þ . . . þ b xm1 þ e2 0 1 2 m1 2 .. > > . : y2 ¼ b0 þ b1 xn1 þ b2 xn2 þ . . . þ bm1 xm1 þ en

ð2Þ

Where e1 ; e2 ; . . .; en are independent of each other, obeying the distribution of e  N ð0; r2 Þ. In order to facilitate mathematical processing, Eq. (2) is represented in a matrix is as follows:

Then Eq. (1) is represented by a matrix, to as shown in Eq. 3. 

Y ¼ Xb þ e e  N ð1; r2 In Þ

ð3Þ

126

5.3

W. Zhang et al.

Parameter Calculation

Parameters b0 ; b1 ; b2 ; . . .; bm1 in the regression equation are unknown, When we use b1; b b2;    ; b b m1 to estimate the parameters b0 ; b1 ; b2 ; . . .; bm1 in sample statistics b b0; b the regression equation, Estimated multiple regression equation, as show in Eq. (1): by ¼ b b0 þ b b 1 x1 þ b b 2 x2 þ    þ b b m1 xm1

ð4Þ

b b b ;b Then use the least square method to obtain the value of b 0 b 1 ; b 2 ;    ; b m1 , That is, the sum of squared residuals is minimized so that the parameters of the regression equation are solved. Show in Table 3: Table 3. The parameters of the regression equation b b0 China 31.9188 India 43.7882 Japan 5.2331

b b1 0.6263 0.8644 0.813

b b b2 b3 0.069 −0.0823 0.0965 0.0933 0.0013 0.07288

b b b4 b5 0.08785 0.1135 −0.02727 −0.0023 −0.1283 0.2002

Therefore, the corresponding regression equations for the above three links as shown in Eqs. (5–7): China: Y ¼ 31:9188 þ 0:6263x1 þ 0:069x2  0:0823x3 þ 0:08785x4 þ 0:1135x5 ð5Þ India: Y ¼ 43:7882 þ 0:8644x1  0:0965x2 þ 0:0933x3  0:02727x4  0:0023x5 ð6Þ Japan: Y ¼ 5:2331 þ 0:813x1 þ 0:0013x2 þ 0:07288x3  0:1283x4 þ 0:2002x5 ð7Þ Correlation Coefficient Check Root Mean Square Error (RMSE) is used in this paper to test the pros and cons of the prediction results. RMSE is the sum of the squared error of the predicted value and the real value [36]. It is a quantitative tradeoff method. As shown in Eq. (8): sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN xm1 Þ2 m¼1 ðxm1   RMSE ¼ N

ð8Þ

Where xm1 is the real value and xm1 is the predicted value. Obviously, the smaller the value of RMSE, the better the prediction effect. As can be seen from the above Table 4, the multiple linear regression model is superior to the other three algorithms in predicting Internet performance data. Therefore, this paper uses a multiple linear regression to predict Internet performance.

Internet Performance Prediction Framework Based on PingER Dataset

127

Table 4. RMSE values for the four algorithms Liner regression JP.U-TOKYO.AC.N1 0.4016 IN.MITPUNE.WWW 1.81101 CN.EDU.N1 2.5147

Random forest Gradient boosting XGBoost 0.627 0.6608 25.4598 3.1409 5.3875 63.0183 6.6665 6.9849 35.2993

6 Results and Discussion According to the above theories and methods, the paper finally chooses to use multiple linear regression model to predict the Internet performance. In this paper, the average RTT values of three links from China, India, and Japan are selected. There are a total of 365 data points per links. We select the first 338 data points to be used as the training set input regression model to train and obtain the prediction regression equation. The last 27 data were used as test data. The prediction results of selecting one of the links are as follow in Table 5.

Table 5. Predicted and real values Date 2018/4/23 2018/4/24 2018/4/25 2018/4/26 2018/4/27 2018/4/28 2018/4/29 2018/4/30 2018/5/1 2018/5/2 2018/5/3 2018/5/4 2018/5/5 2018/5/6

Real value 163.727 163.216 170.055 165.683 168.441 168.379 163.766 163.755 164.781 163.775 166.923 183.442 169.242 164.638

Predicted value 166.200 166.748 170.020 166.935 168.969 168.875 166.437 166.264 167.080 166.238 167.928 178.659 168.890 166.801

Error rate 1.51% 2.16% -0.02% 0.76% 0.31% 0.29% 1.63% 1.53% 1.40% 1.50% 0.60% −2.61% −0.21% 1.31%

Date 2018/5/7 2018/5/8 2018/5/9 2018/5/10 2018/5/11 2018/5/12 2018/5/13 2018/5/14 2018/5/15 2018/5/16 2018/5/17 2018/5/18 2018/5/19

Real value 167.415 164.198 163.873 164.964 166.781 164.964 163.984 164.408 164.017 163.838 168.782 187.569 174.982

Predicted value 169.225 168.036 166.524 166.817 168.129 166.583 166.054 166.480 166.301 166.006 169.167 181.341 172.483

Error rate 1.08% 2.34% 1.62% 1.12% 0.81% 0.98% 1.26% 1.26% 1.39% 1.32% 0.23% −3.32% −1.43%

According to the obtained multiple linear regression equations, the Internet performance prediction and analysis are realized. The real values and predicted values of the three links are shown in Figs. 4, 5, and 6. In addition, using the multiple linear regression to estimate the average error of the predicted and real values of the Internet performance are 0.59%, 0.54%, and 0.12%, respectively, and the prediction accuracy is high. Therefore, the model can be used to predict the Internet performance.

W. Zhang et al.

Average Round-trip Ɵme (ms)

128

190

Real value

Predicted value

185 180 175 170 165 160

Average Round-trip Time (ms)

Fig. 4. Comparison of predicted and real values of CN.EDU.N1

124.8 124.6

Real value

Predicted value

124.4

124.2 124 123.8 123.6

Average Round-trip Time (ms)

Fig. 5. Comparison of predicted and real values of JP.U-TOKYO.AC.N1

280 275

Real value

Predicted value

270 265 260 255

250

Fig. 6. Comparison of predicted and real Values of IN.MITPUNE.WWW

Internet Performance Prediction Framework Based on PingER Dataset

129

7 Conclusion This paper predicts the performance of the Internet links based on the data collected through PingER end-to-end Internet monitoring framework. The main performance indicator is the average round trip time selected from the SLAC monitoring host in USA to the target countries e.g., China, India, and Japan for 365 days. For the first step, we do the pre-processing mainly by replacing the missing values with the average values, file format conversion, and extracting the key features from the data. Afterward, the data set is divided into two parts, with the first 338 days of data as the training set and the last 27 days of data as the test set. Then we use Multiple Linear Regression, Random Forest, Gradient Boost, XGBoost to establish the data prediction model, and use RMSE to evaluate the model. In the end, we found that when we used Multiple Linear Regression to predict the Internet performance of the links data, we got the best results. Therefore, it is expressive to predict the performance of the Internet links using Multiple Linear Regression model. It can help network administrators, policy makers, and network service providers to effectively leverage existing Internet infrastructure. In addition, it will help them to design high-performance next-generation Internet infrastructure. Acknowledgments. This work is supported in part by CERNET Innovation Project under Grant No. NGII20170102, Natural Science Foundation of China under Grant No. 61772007, 61632009, Guangdong Natural Science Foundation of China under Grant No. 2016A030313540, Guangzhou Science and Technology Program under Grant No. 201707010284.

References 1. Singla, A., Chandrasekaran, B., Godfrey, P.B., Maggs, B.: The Internet at the speed of light. In: Proceedings of the 13th ACM Workshop on Hot Topics in Networks - HotNets-XIII, pp. 1–7 (2014) 2. Ali, S., Cottrell, R.L., Nveed, A.: Pinger Malaysia-internet performance measuring project: A case study (No. SLAC-PUB-16462). SLAC National Accelerator Lab., Menlo Park, CA, United States (2016) 3. Samknows Homepage. https://www.samknows.com/. Accessed 30 May 2018 4. Sundaresan, S., Burnett, S., Feamster, N., de Donato, W.: BISmark: A testbed for deploying measurements and applications in broadband access networks. In: Proceedings 2014 USENIX Annual Technical Conference (USENIX ATC 2014), pp. 383–394 (2014) 5. Sánchez, M., Otto, J.: Dasu: Pushing Experiments to the internet’s edge. In: Proceedings of USENIX Association, pp. 487–499 (2013) 6. Sonntag, S., Manner, J., Schulte, L.: Netradar – Measuring the wireless world. In: Wireless Network Measurements, pp. 29–34 (2013) 7. Faggiani, A., Gregori, E., Lenzini, L., Luconi, V., Vecchio, A.: Smartphone-based crowdsourcing for network monitoring: opportunities, challenges, and a case study. IEEE Commun. Mag. 52, 106–113 (2014) 8. Bajpai, V., Eravuchira, S.J., Schönwälder, J.: Lessons learned from using the RIPE atlas platform for measurement research. ACM SIGCOMM Comput. Commun. Rev. 45, 35–42 (2015)

130

W. Zhang et al.

9. Hanemann, A., et al.: PerfSONAR: A service oriented architecture for multi-domain network monitoring. In: Benatallah, B., Casati, F., Traverso, P. (eds.) ICSOC 2005. LNCS, vol. 3826, pp. 241–254. Springer, Heidelberg (2005). https://doi.org/10.1007/11596141_19 10. Matthews, W., Coffrell, L.: The PingER project: active Internet performance monitoring for the HENP community. IEEE Commun. Mag. 38, 130–136 (2000) 11. Paxson, V.: End-to-end Internet packet dynamics. IEEE/ACM Trans. Netw. 7, 277–292 (1999) 12. Wu, C.L., Chau, K.W., Li, Y.S.: Methods to improve neural network performance in daily flows prediction. J. Hydrol. 372, 80–93 (2009) 13. Zhou, D., Chen, S., Dong, S.: Network Traffic Prediction Based on ARFIMA Model. arXiv Prepr. arXiv1302.6324, vol. 9, pp. 106–111 (2013) 14. Shang, P., Li, X., Kamae, S.: Nonlinear analysis of traffic time series at different temporal scales. Phys. Lett. Sect. A Gen. At. Solid State Phys. 357, 314–318 (2006) 15. Nury, A.H., Hasan, K., Alam, M.J.: Bin: Comparative study of wavelet-ARIMA and wavelet-ANN models for temperature time series data in northeastern Bangladesh. J. King Saud Univ. Sci. 29, 47–61 (2017) 16. Yin, H., Lin, C., Sebastien, B., Li, B., Min, G.: Network traffic prediction based on a new time series model. Int. J. Commun Syst 18, 711–729 (2005) 17. Karunasinghe, D.S.K., Liong, S.Y.: Chaotic time series prediction with a global model: artificial neural network. J. Hydrol. 323, 92–105 (2006) 18. Weiyong, Z., Guangli, F.: Network traffic combination forecasting based on encompassing tests and support vector machine. Comput. Eng. Appl. 15, 84–87 (2013) 19. Hongxing, C.: Network traffic prediction based on extreme learning machine and least square support vector machine. Comput. Eng. Appl. 51(24), 73–77 (2015) 20. Hammami, C., Jemili, I., Gazdar, A., Belghith, A.: Hybrid live P2P streaming protocol. Procedia Comput. Sci. 32, 158–165 (2014) 21. Hodge, V.J., Austin, J.: A Survey of outlier detection methodoligies. Artif. Intell. Rev. 22, 85–126 (2004) 22. Cui, F.: Study of traffic flow prediction based on BP neural network. In: 2010 2nd International Workshop on Intelligent Systems and Applications, pp. 1–4 (2010) 23. Breiman, L.: Random forest. Mach. Learn. 45, 5–32 (2001) 24. Guelman, L.: Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Syst. Appl. 39, 3659–3667 (2012) 25. Parker, C., Fern, A., Tadepalli, P.: Gradient boosting for sequence alignment. In: Proceedings of the National Conference on Artificial Intelligence, vol. 21, no. 1, p. 452. (2006). AAAI Press, Menlo Park. MIT Press, Cambridge, London (1999) 26. Chen, T., He, T.: XGBoost: eXtreme Gradient Boosting. R Packag. version 0.4-2, pp. 1–4 (2015) 27. Chen, T., Guestrin, C.: XGboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016) 28. Ali, S., Wang, G., Cottrell, R.L., Masood, S.: Internet performance analysis of south asian countries using end-to-end internet performance measurements. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications, 2017 IEEE International Conference on Ubiquitous Computing and Communications, pp. 1319–1326 (2017)

Internet Performance Prediction Framework Based on PingER Dataset

131

29. Ali, S., Wang, G., Cottrell, R.L., Anwar, T.: Detecting anomalies from end-to-end internet performance measurements (PingER) using cluster based local outlier factor. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications, 2017 IEEE International Conference on Ubiquitous Computing and Communications, pp. 982– 989 (2017) 30. Mal, A., Sabitha, A.S., Bansal, A., White, B., Cottrell, L.: Analysis and clustering of PingER network data. In: Proceedings of 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence), pp. 268–273 (2016) 31. Ali, S., Wang, G., White, B., Cottrell, R.L.: A Blockchain-based decentralized data storage and access framework for PingER. In: 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), pp. 1303–1308. IEEE (2018) 32. Ali, S., Wang, G., Xing, X., Cottrell, R.L.: Substituting missing values in end-to-end Internet performance measurements using k-Nearest neighbors. In: 2018 IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), pp. 919–926. IEEE (2018) 33. Internet Word Stats Homepage. https://www.Internetworldstats.com/stats3.htm. Accessed 11 June 2018 34. Guo, H., Wang, X., Gao, Z.: Uncertain linear regression model and it’s application. J. Intell. Manuf. 28, 559–564 (2017) 35. Sun, H., Liu, H., Xiao, H., He, R., Ran, B.: Short term traffic forecasting using the local linear regression model. Transp. Res. Rec., 143–150 (2003) 36. Kumar, S., Gangwar, S.S.: Intuitionistic fuzzy time series: an approach for handling nondeterminism in time series forecasting. IEEE Trans. Fuzzy Syst. 24, 1270–1281 (2016)

MS-RAID: An Energy-Saving Data Layout for CDP Jingyu Liu1,3, Ziyao Zhang1,3, Lu Liu2 ✉ , and Xin Chai1,3 (

1

)

School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China 2 School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China [email protected] 3 Hebei Province Key Laboratory of Big Data Calculation, Tianjin 300130, China

Abstract. Continuous data protection (CDP) provides unlimited granular recovery point objective (RPO) and nearly instant recovery time objective (RTO). It requires a great fluctuations in the performance of storage system. The system requires higher storage bandwidth when it is active, and lower when inactive. Raid, that is used in storage system normally and provides fixed performance, may face a performance bottleneck or high power consumption. This paper proposes MS-RAID, that bases on S-Raid and can provide multi-level dynamic mapping storage scheme. MS-RAID has vary levels of grouping strategies. MSRAID can meet the needs of real-time dynamic load by changing the number of parallel disks. When the throughput rises, MS-RAID turns the high-level disk group into running to avoid the bottleneck of system performance. When the throughput falls, MS-RAID turns the low-level disk group into running to save energy consumption. Experiments show that MS-RAID is a more energy-efficient data layout, and can save more energy consumption and improve then perform‐ ance than S-RAID. Keywords: RAID · CDP · Energy-saving · Storage · Data layout

1

Introduction

In the era of Big Data, data volume grows exponentially [1, 2]. IDC’s analysis predicts that global data volume will double every two years. By 2020, the world’s data storage capacity is expected to reach 44 ZB, and China’s total data volume will increase to 8.06 ZB, accounting for 18% of the world’s total [3]. Data affects every aspect of society, such as government decisions, corporate operations, and personal lives. More and more attention has been paid to the data reliability. Continuous Data Protection (CDP) can backup data automatically, and save any version of data. CDP allows to restore data to any point in time and provide fine granularities of restorable objects. CDP needs a large amount of storage space. That makes the scale of the data center grow continuously. However, CDP has not fixed requirements on the bandwidth of the storage system, which has certain volatility and obvious time characteristics. When the system is active (usually in the day), high storage bandwidth is required; when the system is inactive (usually at night), low storage bandwidth is required. The current storage systems do not take into account the storage system’s responsiveness to dynamic loads. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 132–141, 2018. https://doi.org/10.1007/978-3-030-05057-3_10

MS-RAID: An Energy-Saving Data Layout for CDP

133

The load characteristics of the CDP system cannot be well adapted. When the load increases, system performance bottlenecks may occur. When the load decreases, addi‐ tional energy consumption is generated. Data centers deployed RAID [4] to provide large capacity and high efficiency of the whole storage system. But the energy consumption enlarges dramatically with the data center’s expansion, and that cannot be ignored. Lot of researches are focus on energy-saving of disk storage systems. DRPM [5] (Dynamic Rotations Per Minute) algorithm uses a multistage rotational speed disk according to the real-time change of the workload to save energy consumption. DPPDL [6] (Dynamic Partial-parallel Data Layout) adjusts disks parallelism based on system load dynamically, which reduces energy consumption to a certain extent. But it wastes part of disk space. Xu [7] proposed the SpringFS algorithm that uses an elastic multilevel load allocation method, and assigns different copies among servers according to the changes in the workload to reduce energy consumption. Study [8, 9] shows that 80% of the total cost of large data centers comes from the energy consumption of disk storage systems. Li [10–12] proposed S-RAID, a storage data layout for continuous data storage, to save the energy consumption of data center. Energy-saving in storage system can be effective, but its application has not been optimized. According to CDP’s throughput characteristics of CDP data, this paper proposes Multiple S-RAID (MS-RAID), a storage space multi-stage mapping data layout for energy-saving, to balance the energy consumption and performance. The contribution of this paper mainly includes: 1. A data layout is proposed: MS-RAID, the disk array is divided into multilevel storage space, which can provide different access performance and dynamic load features suitable for CDP data access; 2. MS-RAID uses Data Increment Algorithm to optimize the number of I/O operations and reduce the writing penalty caused by small-write; 3. A parameter-based disk state control algorithm is proposed, which sets a state parameter for each disk, and adjusts dynamically according to I/O access to achieve the purpose of controlling disk state. Experiments simulate the 32-way video monitoring system. It is proved that the data layout can not only avoid the overperformance of the system, but also meet the highperformance requirements of the system, and achieve high efficiency and energy saving data transmission.

2

MS-RAID: Multilevel Mapping Data Layout

2.1 MS-RAID Data Layout RAID5 is a storage solution that can ensure storage performance, data security and storage cost. S-RAID5 groups the data chunks on each stripe and uses parallel access in the group, which is not only conducive to the dormancy of the non-working disk, but also ensures the performance requirements of the system. However, S-RAID5 cannot adapt well to the storage characteristics, because CDP has dynamic requirements for the

134

J. Liu et al.

performance of the system. Based on S-RAID5, MS-RAID5 optimizes the access char‐ acteristics of CDP data, and uses a multi-level grouping strategy to meet the dynamic demand of the system load and maximize the energy saving effect of CDP. MS-RAID, as shown in Fig. 1, makes a multilevel grouping of data chunks on the same strip in the storage system. The number of data chunks in the group of different levels is difference. It provides multiple-level performance. Low-level groups, which have less disks, have low performance and energy-consumption. High-level groups, which have more disks, have high performance and energy-consumption. When the system is idle, the performance requirements are lower, low-level group runs, and highlevel groups are standby for energy-saving. On the contrary, when the system is busy, the performance requirements are higher, high-level group runs for higher performance, and low-level groups are standby for energy-saving.

Application Server

TCP/IP

CDP Storage System

File System

Virtualization Manager

Disk0

Disk1

Disk2

Group0

Disk3

Disk4

Disk5

Group1

Disk6

Diskn

Group2

Fig. 1. Schematic of MS-RAID

Because parity disk is the bottleneck of system performance in RAID4, RAID5 is adopted in this paper. MS-RAID5 with N disks (N ≥ 3) is divided into N stripes. Stripei denotes the strip in the array, and Parityi denotes the parity chunk. X (i, j) denotes the storage chunk in the array, where i denotes the strip, and j denotes the disk in the array, 0 ≤ i, j ≤ N−1. D(i, j) denotes the data chunk in the array. D(i, j) can be expressed as formula (1): { D(i, j) =

X(i, j), i + j < N − 1 X(i, j + 1), i + j ≥ N − 1

(1)

Parityi of the same strip can be expressed as formula (2): Parityi = X(i, N − 1 − i)

(2)

MS-RAID: An Energy-Saving Data Layout for CDP

135

In order to adapt to the dynamic requirement of CDP, multilevel grouping of disk arrays is set. N−1 data chunks on each stripe are divided into Q groups, and Sq chunks in each group (Q ≥ 2, Sq ≥ 1). The relationship between each group and its chunk allocation satisfies the formula (3): Q ∑

Sq = N − 1

(3)

q=0

Figure 2 is MS-RAID5 with 6 disks that set into two-level groups. Grp0 includes Disk0 and Disk1, Grp1 includes Disk2, Disk3 and Disk4. The parity chunk is evenly distributed in all disks shown as Fig. 2. When the system is idle, Grp1 is set into standby state, and Grp0 runs to meet requirements. The data is written into the sub-data chunks D0 ~ D7, D20 ~ D27… When the system is busy, Grp1 runs, and Grp0 is turned into the standby state, and the data is written to the sub-data chunk D8 ~ D19, D28 ~ D39… More disks running can provide a higher storage bandwidth. Grp0 Disk0

Stripe0

Grp1 Disk1

Disk2

Disk3

Disk4

Disk5

D0 D2 D4 D6

D1 D3 D5 D7

D8 D11 D14 D17

D9 D12 D15 D18

D10 D13 D16 D19

P0 P1 P2 P3

D20

D21

D28

D29

P4 P5 P6 P7

D30

Stripe1 D27

D39

Stripe 5

Fig. 2. The 6-disk two-level MS-RAID5 data layout

2.2 Read and Write MS-RAID calculate parity data using DIP [12] algorithm to avoid the write penalty when writing data into the array. The parity data in the same stripe can no longer read the original data in the data chunk, but only the written data and the original parity data needs XOR operation: P = ⊕Diskwrite

where P denotes the parity data, and Diskwrite denotes the data to be written.

(4)

136

J. Liu et al.

When each strip begins to write, the parity data is initialized as XOR value with the first chunk and the second chunk, shown in Fig. 3, the initialization of the Stripe0 subparity chunk P0 is:

P0 = D0 ⊕ D1

(5)

Data

XOR

Stripe0

D0 D2 D4

D1 D3 D5

D8 D11 D14

D9 D12 D15

D10 D13 D16

P0 P1 P2

D6

D7

D17

D18

D19

P3

Fig. 3. Initialize parity chunk

When the data is written into the different chunk of the same stipe, the old data need not to be read as in RAID or S-RAID, and only the old parity data need to be read. The new parity P′0 (shown as in Fig. 4) can be calculated as the formula (6): P′0 = D8 ⊕ D9 ⊕ P0

(6)

Data

Data

XOR

Strip0

D0

D1

D8

D9

D10

P0

D2 D4

D3 D5

D11 D14

D12 D15

D13 D16

P1 P2

D6

D7

D17

D18

D19

P3

Fig. 4. Calculate new parity data

The DIP algorithm with pre-read strategy can not only avoid waking up inactive disks, but also reduce the writing punishment brought by Read-Modify-Write effectively. 2.3 Disk Scheduling The purpose disk scheduling is to enable the RAID with multilevel packets to adapt to dynamical requirement of CDP systems for energy-saving. Because the mapping between the logical block address (LBA) and the physical block address (PBA) is unknown before the data chunk is written, it is necessary to create a mapping table between them and update it in time during the data chunk is written.

MS-RAID: An Energy-Saving Data Layout for CDP

137

Given the logical address blkno, its group number: Grpp is calculated as formula (7): ⎧0 ,p = 0 ⎪ j=p−1 ∑ LBA(Grpp ) = ⎨ N⋅ Sj , p > 0 ⎪ j=0 ⎩

(7)

The logical address of the data chunk blkno is calculated as formula (8): f (blkno) = D(fstripei,v (blkno), fDj (blkno))

(8)

The physical location of the stripei is calculated as formula (9): ⌊ fstripei (blkno) =

blkno − LBA(Grpp )



(9)

m ⋅ Sp

The physical location of the sub-stripe V of the strip group stripei is calculated as formula (10): ⌊ ⌋ fstripei,v (blkno) = (blkno − Sp ⋅ G ⋅ m)mod Sp

(10)

The disk number of the data chunk is calculated as formula (11): ∑

k=p−1

fDj (blkno) = [blkno − LAB(Grpp )mod Sp +

Sk ]

(11)

k=0

MS-RAID schedule disks according to different system requirements. In order to locate to the data address quickly when disks are started, a write address pointer PLBA is set to record the last chunk which to be written in each group, and it can reduce system addressing delay. The new data is written to the next location of pointer when a new group is started up.

3

Experiment and Analysis

In this section, the performance and energy saving test of MS-RAID is carried out based on CDP of the video monitoring data storage system. 3.1 Performance Testing The MS-RAID chunk can be adjusted according to the specific requirements of the storage system. Based on the monitoring environment, at least 4.6T data storage space is needed, and the additional 10% storage space for the file system. To meet this envi‐ ronment, a two-level MS-RAID5 consisting of 6 disks(1T) is configured with the Linux 3.1 kernel. G0 group consists of two disks, and the G1 group consists of three disks.

138

J. Liu et al.

IOMeter is a very powerful I/O test software. The IOMeter writes 2 KB–4096 KB, 40%, 60%, 80% and 100% continuous requests to MS-RAID and S-RAID through the Dynamo load generator on the Linux side, respectively. The performance test results under different load requirements shown as Fig. 5. When the data block is small, the performance of MS-RAID is not much different from that S-RAID. When the size of data block larger than 128 KB, the write performance of MS-RAID is improved significantly. It because that: (1) when the size of the data block is less than 128 KB, the MS-RAID enables the low-level group G0, the group G1 is in the standby state, the low-level disk group is as same as the S-RAID5 parallelism, and the writing performance has no obvious gap; (2) when the size of the data block is larger than 128 KB, group G0 in MS-RAID cannot meet the performance requirements and to adapt to the higher load. In the high-level group G1, the parallel degree in the group is increased, the stripe size is higher than that S-RAID5. The whole amount of data is increased, the frequency of the parity data writing are less, and the write performance is improved significantly. 120 40% 2G0+3G1 MS-RAID

60% 2G0+3G1 MS-RAID

40% 2 disk group S-RAID

60% 2 disk group S-RAID

Transfer Size (MBps)

Transfer Size (MBps)

80

60

40

20

100

80

60

40 0 16

32

64

128 256 512 1024 Request Length (KB)

2048

4096

16

80% 2G0+3G1 MS-RAID 80% 2 disk group S-RAID

200

Transfer Size (MBps)

Transfer Size (MBps)

150

100

50

16

32

64

128 256 512 1024 Request Length (KB)

2048

4096

2048

4096

100% 2G0+3G1 MS-RAID 100% 2 disk group S-RAID

150

100

50

32

64

128

256

512

Request Length (KB)

1024

2048

4096

16

32

64

128

256

512

1024

Request Length (KB)

Fig. 5. Performance comparison of 40%, 60%, 80% and 100% sequential write

When the request is smaller than 128 KB, the response time gap between the four schemes is not obvious, shown as Fig. 6, they are in the disk opening process, and the disk group pre-reading is performed. When writing requests is 256 KB, MS-RAID has turned on the G1 disk group, which has improved parallelism and low response time compared to S-RAID.

90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 -5 16

55

32

64

128

256

512

1024

2048

50

40 35 30 25 20 15 10 5 0 -5 16

4096

60% 2G0+3G1 MS-RAID 60% 2 disk group S-RAID

45

32

64

Request Length (KB) 50

40 35 30 25 20 15 10 5 0 -5 16

256

512

1024

2048

4096

1024

2048

4096

55

80% 2G0+3G1 MS-RAID 80% 2 disk group S-RAID

Average Write Response Time (ms)

Average Write Response Time (ms)

45

128

Request Length (KB)

55 50

139

60

40% 2G0+3G1 MS-RAID 40% 2 disks group S-RAID

Average Write Response Time (ms)

Average Write Response Time (ms)

MS-RAID: An Energy-Saving Data Layout for CDP

32

64

128

256

512

1024

2048

4096

45

100% 2G0+3G1 MS-RAID 100% 2 disk group S-RAID

40 35 30 25 20 15 10 5 0 -5 16

Request Length (KB)

32

64

128

256

512

Request Length (KB)

Fig. 6. Comparison of write response time

The performance experiment shows that the multilevel group of MS-RAID can allo‐ cate the storage space for writing requests dynamically, adjust the number of disk parallel and open the compatible disk group according to the characteristics of CDP load change. Compared with the more advanced S-RAID, under the demand of low system perform‐ ance, MS-RAID opens the low-level disk group to reduce disk parallelism and reduce energy consumption. Under the demand of higher system performance, MS-RAID opens a high-level disk group, increases disk parallelism, and ensures the system’s demand for performance. MS-RAID can guarantee the performance requirements of the CDP system while opening the appropriate disk group, which effectively balances the contradiction between energy consumption and system performance. 3.2 Energy Consumption Test In the energy consumption test, the total energy consumption of the disk array is calcu‐ lated as formula (12):

W=

n ∑

Vi × Ii

(12)

i=0

where W denotes the total energy consumption of disk array, Vi denotes the real time voltage, and Ii denotes the real time current of disk i. In order to strengthen the comparison test of different groups’ energy consumptions, S-RAID5 with 7 disks is set in the same environment, including two groups of data disks (3 disks per group) and a parity disk.

140

J. Liu et al.

In order to avoid the impact of the system cache on the experimental data, the MSRAID and S-RAID are monitored continuously for 24 h of energy consumption after the system runs 1 day. The results of the energy consumption test are shown in Fig. 7:

Power Consumption (W)

25

20

2 disks group S-RAID 5 3 disks group S-RAID 5 2G0+3G1 MS-RAID 5

15

10

5 19:00

01:00

07:00 Time

13:00

19:00

Fig. 7. Energy consumption comparison of three schemes

At the beginning of the test, the difference in energy consumption of MS-RAID5 and S-RAID5 with 2 disks a group is not significant. It because that only 20 cameras are running at first and the load is small. MS-RAID5 only opens the low-level group G0 on the premise of guaranteeing the performance requirement of the system. The S-RAID5 of the three disks per group opened a disk more than MS-RAID5, and that leads to excessive system performance and greater energy consumption. At this stage, the average energy consumption of MS-RAID is 9.3 W, and the average energy consump‐ tion of S-RAID with 3 disks per group is 12.3 W, saving 24.4% energy consumption. As the experiment goes on to the second stage, 32 cameras work at the same time. The load increases, and the performance requirements of the system increase too. MSRAID5 has opened a higher performance group G1, which has increased energy consumption while providing higher performance. Although S-RAID5 with 2 disks per group keeps low energy consumption, it cannot meet the high performance of the system. The performance test and energy consumption test show that MS-RAID not only reduces the energy consumption when the CDP system is idle, but also ensures the system performance when it is busy. It balances the contradiction between energy consumption and performance.

4

Conclusion

In the paper, a multilevel grouping strategy is proposed for CDP systems, and the data layout of multilevel grouping is designed and implemented: MS-RAID. DIP algorithm is used to optimize the writing performance in MS-RAID. The higher energy efficiency of the storage system is realized by the disk energy saving scheduling. The multiple

MS-RAID: An Energy-Saving Data Layout for CDP

141

level division strategy is adopted in the array, and different groups have different amounts of disks. The amounts of disks correspond to the different performance require‐ ments of the CDP system. The experiments show that 6-disk two-level MS-RAID5 performance improves 15.9%, 29.0%, 31.4% and 33.6% than 2-disk group S-RAID5 when the workload is 40%, 60%, 80% and 100% sequential, and saves 34.6% energy consumption than 3-disk S-RAID5. It proves that MS-RAID is a more energy-efficient data layout. Acknowledgements. The work was supported by the Natural Science Foundation of China (No. 61876019), the Natural Science Foundation of Hebei Province (Grant No. F2016202145), the Youth Foundation of Education Commission of Hebei Province (Grant No. QN2014192), and the Science and Technology Planning Project of Hebei Province of China (grant No. 15210325).

References 1. Yu, X., Tan, Y., Zhang, C., Liang, C., Khaled, A., Zheng, J., Zhang, Q.: A high-performance hierarchical snapshot scheme for hybrid storage systems. Chin. J. Electron. 27(1), 76–85 (2018) 2. Yan, F., Tan, Y., Zhang, Q., Wu, F., Cheng, Z., Zheng, J.: An effective RAID data layout for object-based de-duplication backup system. Chin. J. Electron. 25(5), 832–840 (2016) 3. Dong, Y., Liu, J., Yang, J., et al.: HS-RAID 2: optimizing small write performance in HSRAID. J. Electr. Comput. Eng. 2016, Article no. 7341735, 8 pages (2016) 4. Patterson, D.: A case for redundant arrays of inexpensive disks. In: Proceedings of ACM SIGMOD Conference (1988) 5. Gurumurthi, S., Sivasubramaniam, A., Kandemir, M., et al.: DRPM: dynamic speed control for power management in server class disks. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, San Diego, pp. 169–179. IEEE (2003) 6. Sun, Z., Zhang, Q., Li, Y., Tan, Y.A., et al.: DPPDL: a dynamic partial-parallel data layout for green video surveillance storage. IEEE Trans. Circuits Syst. Video Technol. PP(99), 1 (2016) 7. Xu, L., Cipar, J., Krevat, E., et al.: SpringFS: bridging agility and performance in elastic distributed storage. In: Proceedings of Usenix Conference on File and Storage Technologies, pp. 243–255. USENIX Association (2014) 8. Basmadjian, M., Hermann, M.D., Lent, R., Giovanni, G.: Cloud computing and its interest in saving energy: the use case of aprivate cloud. J. Cloud Comput. Adv. Syst. Appl. 1(5), 1– 11 (2012) 9. Eric, S., Michael, L., Jon, S., et al.: Computational solutions to large-scale data management and analysis. Nat. Rev. Genet. 11, 647–657 (2010) 10. Li, X., Tan, Y., Sun, Z.: Semi-RAID: a reliable energy-aware RAID data layout for sequential data access. In: Proceedings of IEEE, Symposium on MASS Storage Systems and Technologies, pp. 1–11. IEEE Computer Society (2011) 11. Liu, J., Zhang, J., Li, Y., et al.: Hybrid S-RAID: an energy-efficient data layout for sequential data storage. J. Comput. Res. Dev. 50(1), 37–48 (2013). (in Chinese) 12. Liu, J., Tan, Y., Xue, J., et al.: Writing optimization strategy in S-RAID based on sequential data characteristics. Chin. J. Comput. 37(3), 721–734 (2014). (in Chinese)

Incentivizing Multimedia Data Acquisition for Machine Learning System Yiren Gu1, Hang Shen1,2(&), Guangwei Bai1, Tianjing Wang1, Hai Tong1, and Yujia Hu1 1

2

College of Computer Science and Technology, Nanjing Tech University, Nanjing 211816, China [email protected] Department of Electrical and Computer Engineering, University of Waterloo, Waterloo N2L3G1, Canada

Abstract. To address restrictions on data collection, incentivizing multimedia data acquisition for machine learning system is proposed. This paper presents an effective QoI (Quality-of-Information)-aware incentive mechanism in multimedia crowdsensing, with the objective of promoting the growth of an initial training model. Firstly, an incentive model is constructed in the form of reverse auction to maximize the social welfare while meeting the requirements in quality, timeliness, correlation and coverage. Then, we discuss how to achieve the optimal social welfare in the presence of an NP-hard winner determination problem. Lastly, a practical incentive mechanism to solve the auction problem is designed, which is shown to be truthful, individually rational and computationally efficient. Extensive simulation results demonstrate the proposed incentive mechanism produces close-to-optimal social welfare noticeably and high-QoI dataset is obtained. In particular, a significant performance improvement for machine learning model growth is achieved with lower complexity. Keywords: Multimedia crowdsensing Machine learning  QoI  Auction

 Incentive mechanism

1 Introduction There are two common ways to promote the growth of an initial machine learning model in a short time, i.e., the optimization of algorithm or the improvement of dataset quality. The former (e.g. MobileNets [5, 10]) optimizes model framework by improving algorithm, while the latter provides a large amount of data for continuous training and learning. An immature machine learning model with a large amount of training data can often win a well-designed and high-level model based on only a small amount of training data, such as Automatic Speech Recognition [6] and Image Classification [7]. However, there are strict QoI requirements (including quality, timeliness, correlation, coverage) for datasets with the development of machine learning technology increasingly mature. The model training with large-scale datasets requires a lot of time. Machine learning system hopes receiving high-quality datasets expected before

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 142–158, 2018. https://doi.org/10.1007/978-3-030-05057-3_11

Incentivizing Multimedia Data Acquisition

143

model training which satisfying a certain coverage requirement within the prescribed time and areas at the same time. Multimedia Crowdsensing (MC) [1, 2], the crowdsourcing of multimedia data, has a huge potential to incentivize many new machine learning assisted multimedia applications expected to capture tremendous benefits in a variety of fields including Google Views [8, 9]. Anyone with a Google account can login into Google Views to share panoramic street view they photograph, which can service for the training of machine learning system. What makes it special is that the homepage of Google Views will recommend some street views users contribute. When you click on an account, the street view album of the user’s contribution can be seen. Comparing with traditional data collection methods, MC has made large-scale participatory sensing viable in a speedy manner and with little infrastructure cost by leveraging personal mobile devices as its sensor nodes, which can provide massive datasets for machine learning model training. In general, the cost occurring to a mobile user participating in MC involves resource consumption and privacy leakage. Mobile users may be reluctant to participate in MC without sufficient incentives, which causes that a large amount of high-QoI [4] datasets are not achieved for model training. Therefore, it is necessary to design an effective incentive mechanism to encourage people to participate. Recent research has been focused on some game-theoretic incentive mechanisms for MC systems [17–19] and employs user’s bidding price as an important metric to give rewards. However, most of the existing mechanisms fail to incorporate QoI requirements which depends on special applications. [12] refers to timeliness and efforts of datasets users collect; effective gathering coverage and time of tasks are defined as QoI in [11]. Nevertheless, these indicators have not been all considered in the existing incentive mechanism works. Moreover, few jobs have been done on combining mechanism designs with machine learning scenarios. The requirement for training data becomes higher with the increasingly maturity of machine learning field. In this paper, an effective incentive mechanism for multimedia crowdsensing enabled machine learning system is proposed, focusing on obtaining massive high-QoI training datasets for machine learning models to enhance the growth of training model. The main contributions include: 1. To guarantee the utilities of both machine learning system and participating users, an incentive model based on a reserve auction is presented, which maximizes social welfare subjects to the quality requirement of tasks, timeliness of joining task, correlation and coverage of collected pictures. 2. How to achieve the optimal social welfare is discussed in the presence of an NPhard winner determination problem. Then, a practical incentive mechanism to solve the auction problem is designed, which is shown to be truthful, individually rational and computationally efficient. 3. Extensive simulation results show the proposed mechanism achieves noticeable superiority and produces close-to-optimal solutions. The datasets provided by our mechanism accelerate the growth of machine learning model. The rest of this paper is organized as follows. Motivation of this work is discussed in Sect. 2. Section 3 describes system model of our mechanism. The optimal solution

144

Y. Gu et al.

and auction algorithm are given in Sect. 4. Finally, we present simulation results and performance analysis in Sect. 5 before concluding in Sect. 6.

2 Motivation To obtain a mature Image Classification Model, a 5-5 layer CNN framework is constructed by us. It is trained with the real Belgian traffic datasets [3]. The training results show classification accuracy of model training approaches to 100% with the number of iteration increasing and the loss value of model indeed is falling to 0 (as illustrated in Fig. 1), which illustrate the model we constructed is becoming relatively mature in later stage. However, no matter what the model framework is optimized, there is not much room for improvement of the classification accuracy at the later period. The results of the accuracy of training classification with different batch sizes are shown in Fig. 2.

Fig. 1. Variation diagram of loss function

Fig. 2. Training accuracy curve of different samples

In view of this problem, final classification is analyzed by visualization of matplotlib tool. Note that, if the color of forecast and actual category is green, which illustrates classification results are correct; if red, wrong. Part of classification results is shown in Fig. 3. Visual results reveal that misclassified images are almost low-quality, which indicates the datasets quality is also a key factor slowing down the growth of model.

Fig. 3. Comparison of predictions and actual categories under matplotlib visualization

Incentivizing Multimedia Data Acquisition

145

3 System Model A multimedia crowdsensing system, as shown in Fig. 4, is composed of a machine learning system, a multimedia sensing platform and many smartphone users. For model training needs, the machine learning system announces picture collection tasks f directly, which is expected to be accomplished in a period of time T1  T2 . Considering the diversity of training data, machine learning system needs users to collect M types images, a picture type represents a subtask, denoted as f ¼ fs1 ;    ; sM g. ① announcing tasks

Machine learning system

issuing tasks

② submitting sensing plans

returning results

③ determining winners Multimedia sensing platform

④ uploading sensing pictures ⑤ making payments

Mobile Users

Fig. 4. Interaction model of platform and users in MC

The sensing platform processes requests from the machine learning systems and helps him recruit mobile users. The user set is denoted as U ¼ fu1 ;    ; uN g, where N is the number of users. The interaction between sensing platform and mobile users can be formulated as a reverse auction mechanism design problem, which is described as follows: 1. The platform issues tasks to recruit quality-guaranteed users to participate. 2. Each user submits his sensing plan, which is denoted as a tuple bidi consisting of the set of tasks he wants to execute gi f and his bidding bi for these tasks. 3. The platform uses an incentive mechanism to select the winners and calculates payments, represented as ~ p ¼ fp1 ;    ; pN g. 4. Winners perform sensing tasks and submit results to the platform. 5. The platform checks the results and makes payments for winners. At last, all pictures are sent to machine learning system for model training. 3.1

Auction Framework

Motivated by Sect. 2, low-quality datasets go against enhancing the growth of an initial model. Recruiting a quality crowd to undertake image collection tasks is considered by us. QoI acts an important index which is integrated into our incentive mechanism. It is calculated by sensing platform, which depends on the sensing application and can be defined according to various factors. For example, in [15], QoI refers to the uploaded photos’ quality. Photos with high quality will help the platform better identify the visible problems with medical devices; in [14], QoI refers to the users’ estimation accuracy of air quality. The QoI of our paper is defined as the following.

146

Y. Gu et al.

Definition 1 (QoI of image datasets). The QoI of sensing denoted by q ¼ fq1 ;    ; qN g is the clarity of pictures. It often depends on the joining time ti and task achievements (like correlation ai and coverage of shooting targets bi ) of users, which are perfectly complimentary factors in the process of executing tasks. The joining time ti of a user is earlier, he has more time to prepare for collecting images; pictures are often more relevant with the platform’s requirements and the coverage of shooting is also wider with sufficient time. If these constraints are changed, image quality will also be changed. We assume that platform maintains a historical record of users’ QoI profile q used as inputs for winner and payment determination. Each subtask sj needs to be completed with a minimum quality Qj . Denoted by Q ¼ fQ1 ;    ; QM g the profile of quality requirements for each subtask. Definition 2 (A User’s Utility). The payoff wi of any ui 2 U is defined as the difference between payments pi and cost ci , which satisfies:  wi ¼

pi  c i ; 0;

xi ¼ 1; xi ¼ 0:

ð1Þ

There, xi equals to 1 if ui wins, and xi equals to 0 otherwise.  xi ¼

1; 0;

if ui is chosen as a winner otherwise

ð2Þ

If ui is a winner of our auction, he will be paid pi for executing the corresponding set of sensing tasks. In contrast, he will not be allocated any sensing task and receive zero payment. Definition 3 (Platform’s Profits). The profit of sensing platform is given as follows: / ¼ VðSÞ 

X

pi

ð3Þ

i2S

where the value function VðSÞ represents the sum of the value vi contributed by winner set S. VðSÞ¼

X i2S

vi ¼

X i2S

kqi jgi j

ð4Þ

Equation (4) consists of k, a coefficient that transform the image QoI into monetary reward. jgi j is the number of categories collected by ui . The value function VðSÞ is monotonic in q. That is, for any ~ q ¼ fq1 ;    ; qN g and ~ q0 ¼ fq01 ;    ; q0N g, we have 0 0 V~q ðSÞ  V~q ðSÞ, if qi  qi holds 8ui 2 U. Since the smartphones are owned by different users, which are selfish but with rational behavior. Therefore, mobile users will not participate in the MC without sufficient incentive. To guarantee the utilities of both sensing platform and participating users, the goal of this paper is similar to the traditional VCG mechanism [20, 21],

Incentivizing Multimedia Data Acquisition

147

aimed at designing an efficient incentive mechanism that maximize the social welfare. It is formally described in Definition 4. Definition 4 (Social Welfare). The social welfare of the whole MC is the sum of the users’ payoff and sensing platform profits: X X c ¼ /þ wi ¼ VðSÞ  ci ð5Þ i2U

3.2

i2S

Desirable Properties

Specifically, a user participating in sensing tasks will incur a cost ci and his maximum executable task set is g0i , which are private and only known to user himself. As a result, ci and g0i could be different from bi and gi respectively. This section describes three desirable properties for our auction mechanism. • Truthfulness: An auction mechanism is truthful if and only if for every bidder ui 2 U, they all adopt the dominant strategy to bid his true value ðg0i ; ci Þ. • Individual Rationality: An auction mechanism is individually rational if for any bidder, the payoff is nonnegative when bidder ui bids his true value ðg0i ; ci Þ. • Computational Efficiency: An auction mechanism is computationally efficient if the outcome can be computed in polynomial time. Truthfulness is the most difficult to achieve of the three properties. The bid is twodimensional that contains two parts: the declared cost bi and task sets gi of bidder ui . Therefore, Myerson’s theorem [13] about the properties of one-parameter truthful mechanisms cannot be directly applied. To design a truthful auction mechanism with two dimensions, the following definitions is introduced: Definition 5 ( b-Monotonicity). if bidder ui wins by bidding ðgi ; bi Þ, then he also wins by bidding ðgi ; b0i Þ with any b0i  bi . Definition 6 ( g-Monotonicity). if bidder ui wins by bidding ðgi ; bi Þ, then he also wins 0 by bidding ðg0i ; bi Þ with all gi  gi . Definition 7 (Critical Payment). the payment pi for winning bidder ui is set to the critical value di such that bidder ui wins if bi  di , and loses if bi [ di . Lemma 1. A mechanism is truthful if it satisfies b-Monotonicity, g-Monotonicity and critical payment. Proof: A truthful bidder will receive positive utility which can be verified easily. If ui 0 is losing with untruthful sensing plan ðgi ; bi Þ or gi 6 gi , his utility will be negative. As a result, the case in which ðgi ; bi Þ is winning and gi g0i requires to be only considered.

148

Y. Gu et al.

Firstly, it can be known that users biding with ðg0i ; bi Þ can win from the property of g- Monotonicity. Suppose that the payment for bid ðgi ; bi Þ is p and for bid ðg0i ; bi Þ is p0 . Any bid ðg0i ; b0i Þ with b0i  p0 is losing because p0 is the critical payment of task g0i . Similarly, bidding with ðgi ; b0i Þ is also losing from monotonicity. Therefore, the critical payment for ðgi ; bi Þ is at most that for ðg0i ; bi Þ, which means p  p0 ; in other words, the user will not increase his utility by bidding ðgi ; bi Þ instead of ðg0i ; bi Þ. Then, the case of true bid with ðg0i ; ci Þ is considered, whose payment is the same as bidding with ðg0i ; bi Þ from the critical payment, i.e.pi . If bidding ðg0i ; ci Þ loses, then we have ci [ p0  bi . ☐ Compared with ðg0i ; ci Þ, bid with ðg0i ; bi Þ will also not increase his utility.

4 QoI-Aware Incentive Mechanism (QoI-RA) In this section, a QoI-aware based on reverse auction (QoI-RA) is presented. First, we discuss how to achieve approximately maximal social welfare. Then, two Algorithms for the discussion are designed by us. Finally, we present a practical QoI-RA that satisfies three properties. 4.1

Optimal Solution of QoI-RA Auction

The goal of QoI-RA is to maximize the social welfare given in Definition 4 while achieving computational efficiency, individual rationality and truthful. The winner selection (QRA-WS) and pricing determination (QRA-PD) can be decoupled into two separate problems. Solving the maximization problem itself, referred to as the QRAWS problem, is challenging because QRA-WS is NP-hard (proved by Theorem 1), let alone combining with the other three properties. QRA-WS Problem: Given the information of a task set f and a user set U, the goal of the QRA-WS problem is to find a subset SU. It can be formulated as the following integer linear program, such that max U

N X

wi xi

ð6Þ

i¼1

Subject to: X

qi x i  Q j ;

8sj 2 f

ð7Þ

i:sj 2gi ;ui 2U

xi 2 f0; 1g; ai (t)  ^a;

8ui 2 U

ð8Þ

T1  t  T2

ð9Þ

^ bi (t; r)  b;

rl

ð10Þ

Incentivizing Multimedia Data Acquisition

149

Using (4) and (5), we get: c ¼ VðSÞ 

X

ci ¼

ui 2S

X ui 2U

ðkqi jgi j  bi Þxi

ð11Þ

Let ~ w ¼ fw1 ;    ; wN g denote the marginal social welfare profile of all users based on user’s bids, where wi ¼ kqi jgi j  bi

ð12Þ

Hence, maximizing the social welfare is actually maximizing the total marginal social welfare of users. c¼

X

w i xi

ð13Þ

ui 2U

4.2

Constraints of Tasks

Definition 8 (Task’s Quality). We use Qj ðSÞ represented by Eq. (14) to denote the total quality that all winners accomplish task sj 2 f. Therefore, the quality of a subtask is equivalent to guaranteeing that every task is executed by users with sufficient amount of quality in total. X Qj ðSÞ ¼ qi ; 8sj 2 gi ð14Þ i2S

The platform stipulates that total quality of images users collected must satisfy the requirement of each subtask. Constraint (7) is each subtask’s quality requirement. Definition 9 (Correlation). The correlation of image type between users upload x and tasks require y can be donated as the function aðÞ: aðt  dÞ ¼ ½2  sgnðt  dÞ  f ðd  tÞ þ sgnðd  tÞ  mxy

ð15Þ

where the relevance of image type collected by users is evaluated from two aspects: objectivity and subjectivity. The former is determined by a function that can satisfy the following properties: (1) it is a monotonically none-increasing of the different between t and d, where t is the joining time of users and d is the deadline of sensing tasks; (2) it returns a value in [0,1]. If users join in tasks earlier than deadline, it always equals to 1. Otherwise, it monotonously decreases from 1 to 0; f ðtÞ is a function of Sigmoid. mxy is decided by platform grades on the image correlation, which ranges from [0,1]. The platform requires that users can meet certain relevance of collecting images, corresponding to (9).

150

Y. Gu et al.

Definition 10 (Coverage). The image coverage bðÞ is represented by: 8 dt ; < ð2  1 þ2er Þ ds dt bðr; tÞ ¼ ; ds : 0 ;

r [ 0;s  t  d r ¼ 0; s  t  d otherwise

ð16Þ

where bðÞ is decided by two parts. The former is a monotonically decreasing function of r, where r is the distance between the shooting location and the target object. When r is large, the target coverage rate is low. In addition, the joining time of users is different, so they take photos with different numbers. It is assumed that once a user has participated in the task, he always executes tasks and takes the same time on a single picture. As a result, the number of pictures is proportional to joining time t of users, which can be described as the latter. The longer you participate, the more pictures you can collect from different angles under enough time and conditions, so the coverage of target becomes wider. For model training needs, the sensing platform hopes the image coverage can meet the specified standards, which is corresponding to (10). The optimal QoI-RA problem is achieved as follows: (1) Winner Selection: Find a subset SU by solving QRA-WS problem; (2) Payment Determination: Determined by the basis of QRA-WS problem. If xi is equal to 0; pi is 0, otherwise. Theorem 1. The QRA-WS problem is NP-hard. Proof: We prove the NP-hardness of the QRA-WS problem by a polynomial time reduction from the minimum weighted set cover (MWSC) problem, which is NP-hard [23]. The MWSC problem is defined as follows: A universe set, denoted by E ¼ fs1 ;    ; sM g, consists of M elements, whose subsets can be denoted as a set f ¼ fg1 ;    ; gN g. Every set gi 2 f has a corresponding non-negative weight wðgi Þ. The MWSC problem is to find the minimum weight subset of f whose union is E. Next, we construct an instance of the QRA-WS problem from an instance of the MWSC problem in polynomial time. Firstly, we transform gi into g0i such that for every element in gi there exist hi 2 Z þ copies of the same element in g0i . We require that every element sj 2 E is covered for at least Hi 2 Z þ times. After reduction, we obtain an instance of the QRA-WS problem. In such a problem, users’ quality profile is denoted as ~ q ¼ fh1 ;    ; hN g; users’ bidding tasks profile is denoted as ~ g ¼ fg1 ;    ; gN g; user’s marginal social welfare profile is ~ w ¼ fw1 ;    ; wN g; user’s duration of service profile is ~ t ¼ ft1 ;    ; tN g; user’s location of service profile is ! ! r ¼ fr ;    ; r g; tasks’ quality requirement profile is Q ¼ fH ;    ; H g. It can be 1

N

1

M

seen vividly that the QRA-WS problem represents a richer family of problems in which the quality qi of any user ui and any task j0 s quality requirement Qj could take any value in R þ . Furthermore, the marginal social welfare can take any value in R. So every instance of the MWSC problem is polynomial-time reducible to an instance of the QRA-WS problem. The QRA-WS problem is NP-hard. ☐

Incentivizing Multimedia Data Acquisition

4.3

151

Mechanism Design

Because of the NP-Hard nature of QRA-WS problem, it is difficult to find a solution to maximize social welfare in polynomial time. Meanwhile, traditional VCG auction mechanism [20, 21] are not directly tailored because it requires the social welfare is exactly maximized. A natural step is to design a computationally efficient mechanism with close-to-optimal social welfare. On this basis of the proposed optimal solution, user’s bids fðg1 ; b1 Þ;    ; ðgN ; bN Þg are utilized by us to calculate the marginal social welfare of users w, which acts as the input of QRA-WS problem. First, platform excludes the users who don’t meet the timeliness, relevance, and coverage of task (lines 2–4). Next, the platform includes the remaining users whose marginal social welfare are non-negative into winner set (lines 5–6). Then the platform will get the user set X whose marginal social welfare is negative by removing the current winner from user set N  (line 7). It calculates task’s remaining quality requirements profile Q0 by subtracting from Q0 the quality provided by the currently selected winners (lines 8–9). The main loop is executed until every task’s quality is satisfied (lines 10–15). In the main loop, the minimum marginal social welfare effectiveness is used to select the third batch of users, the formula is defined as follows: n¼P

jwi j 0 j:sj 2gi minfQj ; qi g

ð17Þ

where n is defined as the ratio between the absolute value of the marginal social welfare P 0 jwi j of ui and his effective quality contribution j:sj 2gi minfQj ; qi g. The user with the minimum n among the remaining users in X is included into S. After that, the platform updates set S(lines 10–13) and the residual quality of subtask ~ Q (lines 14–15). Then, the platform pays for winner set S in Algorithm 1. If a user is not winner, his payoff is zero. Algorithm 2 describes the pricing mechanism, which takes the winner set S as input and outputs the payment profile ~ p. Firstly, ~ p is initialized as a zero vector (line 1). Then, like Algorithm 1, platform excludes users who don’t meet the timeliness, relevance, coverage of task and gets set N þ (lines 2–4). Next, the platform includes all users with non-negative marginal social welfare into X þ (lines 5–6). The main loop (lines 7–14) calculates the platform’s payment to every winner. For every winner ui 2 S, the winner determination in Algorithm 1 is executed with all users except ui until the quality requirement of every task in gi is satisfied (lines 7–8).Then, the platform obtains the current winner set S0 (line 9) and calculates differently in the following two case (lines 10–14): Case 1. Any winner ui has wi  0 in this case (lines 10–11). As a result, the user’s critical payment is his bidding price b0i , which satisfies w0 ¼ kqi jgi j  b0i ¼ 0. That is: pi ¼ kqi jgi j

ð18Þ

152

Y. Gu et al.

Case 2. Any winner ui belonging to case 2 (lines 13–14), we go through every uk 2 S0 nU þ . Then we calculate the maximum bidding price b0i of user ui to be able to replace uk as the winner, i.e., b0i satisfies Eq. (19). P

b0  kqi jgi j jwk j ¼P i 0 0 j:sj 2gk minðQj ; qk Þ j:sj 2gi minðQj ; qi Þ

ð19Þ

Incentivizing Multimedia Data Acquisition

153

This can also be expressed as: P j:sj 2gi

b0i ¼ kqi jgi j  wk P

j:sj 2gk

minfQ0j ; qi g minfQ0j ; qk g

ð20Þ

At last, the maximum value among all b0i discussed above is used to pay for ui . 4.4

Proof of Properties

In this section, we show that QoI-RA auction is truthful, individual rational, computational efficient. Theorem 2. The QoI-RA auction is truthful. Proof: We consider any other bid ðg0i ; b0i Þ of ui , if he wins by bidding ðgi ; bi Þ, where b0i \bi or gi g0i . It will be analyzed from two cases.   (1) wi  0. When ui makes a new bid ðg0i ; b0i Þ, w0i ¼kqi g0i   b0i [ kqi jgi j  bi  0. (2) wi  0. The new marginal social welfare of ui is not affected by previous bidding. It is the same as case 1, which makes w0i  0. As a result, ui can also win by new bid ðg0i ; b0i Þ from Algorithm 1. QRA-WS satisfies both bidding tasks and price monotonicity. Furthermore, it is easily verifiable that QRA-PD algorithm uses the supremum of bidding price b0i such that bidding ðgi ; b0i Þ still wins. Hence, from Lemma 1, we conclude that QoI-RA auction is truthful. ☐ Theorem 3. The QoI-RA auction is individual rational. Proof: We prove from two possible cases. First, the payoff of mobile user ui 2 fUnSg is 0 if ui is not a winner according to Algorithm 2. Second, ui is a winner. We have proved that users bid truthfully in our QoI-RA auction from Theorem 2. As a result, each user bids his true cost ci . Since QoI-RA preserves the critical payment property as shown in Lemma 1, every winner will be paid the supremum of bidding price. Then, we have pi  ci for every winner, i.e., wi ¼ pi  ci  0. Therefore, the utility for every user ui is always non-negative, i.e. wi  0. This completes the proof. ☐ Theorem 4. The computational complexity of the QoI-RA auction is OðN 3 MÞ. Proof: QoI-RA auction consists of two algorithms QRA-WS and QRA-PD. The former firstly goes through all users to select someone who meets the requirements of timeliness, correlation, coverage, which needs N iterations. Its computational complexity is embodied in the main loop, which terminates after N iterations in the worst case. In every iteration, it also goes through every task sj 2 f, i.e., the while-loop runs M times. Hence, the computational complexity of Algorithm 1 is OðN 2 MÞ. Similarly, the problem in Algorithm 2 needs N iterations at first. Then, it chooses users whose marginal social welfare is greater than 0, which iterates N times in the worst case. The third for-loop executes Algorithm 1 for each user ui 2 S. So the computational

154

Y. Gu et al.

complexity of Algorithm 2 is OðN 3 MÞ. Therefore, the overall computational com☐ plexity of QoI-RA auction is OðN 3 MÞ.

5 Performance Evaluation In this section, we present and discuss simulation results on the real dataset to justify the effectiveness of the proposed mechanism. 5.1

Simulation Settings

All the evaluation results are based on a real datasets of BelgianTS [3]. The dataset consists of two parts: training (4575 images) and testing data (2520 images). Each contains 62 subdirectories. Each subdirectory from 0 to 61 represents a category/label. Each category has different amount of traffic signs images, which are stored as the format of.ppm. In order to compare with previous experiments mentioned in Sect. 2 easily, all the training images of BelgianTS are imported into our simulation environment. Then, each picture is labeled by us. The basic parameter settings are detailed in Table 1.

Table 1. Parameter setting Parameters k Value 0.1 Parameters jgi j Value

Qj N M qi [1, 2] [10, 15] [100,200] 62 l T1 T2 ci

[20,30] 5

0

30

ti [0,40] ^ a

[2, 4] 0.6

ri [0,7] ^ b 0.6

For comparison, we choose two well-designed incentive mechanisms. The first baseline is the revised version of greedy auction with our constraints defined in Sect. 4, which is truthful and individual rational. Firstly, the winner determination of it selects user with wi  0 who meets constraint (9) and (10) as the winners. Different from our mechanism, it then selects users who has largest marginal social welfare in remaining users until QoI requirements of tasks are met. The pricing mechanism pays each winner his supremum bidding price. The second baseline method is a modified version of traditional VCG auction [20, 21] based on QoI-Aware (QoI-VCG), which consists of winner determination (VCG-WD) and pricing. The concept of QoI and our constraints are integrated into the VCG-WD problem, which can be solved optimally. The pricing mechanism in [20, 21] is used to pay for winners. 5.2

Simulation Results

Experiment 1 compares our mechanism with two well-designed mechanisms about the social welfare. The parameters are given in Table 1. To evaluate the impact of the number of users on the social welfare, we set the number of tasks to 62 and vary the

Incentivizing Multimedia Data Acquisition

155

number of users from 100 to 200 with a step of 20. It can be vividly seen in Fig. 5 that the social welfare of three mechanisms keeps going up when the number of users increases. The social welfare of the QoI-VCG auction equals to the optimal solution of the QRA-WS problem. It can be concluded that the social welfare of QoI-RA auction is close to optimal and far better than the baseline QoI-Greedy auction. In Experiment 2, one of constraints is chosen by us. The coverage of image is changed among 0.5, 0.6, 0.7 with other parameters fixed. As shown in Fig. 6, the coverage is lower, the value of social welfare is greater. It is due to that, the restrictions of these three indexes are set wider, more mobile users have chances to participate. This is helpful for platform’s task accomplished, and thus, the social benefits are relatively high. However, a certain QoI limitation should be set for machine learning model training and it is unsuitable to be set too low. Experiment 3 looks at the running time. The QoI-VCG auction is compared with it. The parameters are the same as Experiment 1. Simulation results are presented in Table 2. QoI-RA auction executes in significantly less time than the QoI-VCG auction. That is because the QoI-VCG auction calculates actual social welfare maximization. With the increasing of the number of users, its execution time gradually becomes so long, which is infeasible to be used in practice. In contrast, the QoI-RA auction approaches the optimal social welfare, it can keep low execution. In a word, the QoIRA auction is much more computationally efficient than the QoI-VCG auction. In Experiment 4, two datasets with the same amounts and different qualities are input into an initial CNN model for its training. One group is a part of the original datasets. The other group is selected by the QoI-RA mechanism. Then, we observe the effects of data quality on model training by comparing the two groups of model training. In Fig. 7, the pictures of QoI-RA collected are input into CNN, the model

Fig. 5. Impact of number of users

Fig. 6. Impact of coverage

Table 2. Comparison of execution time N 100 150 200 250 300 350 400 450 500 VCG 6.325 8.361 10.574 12.016 65.293 32.786 95.475 60.251 2056 QRA 0.119 0.128 0.132 0.157 0.224 0.195 0.219 0.226 0.230

156

Y. Gu et al.

Fig. 7. Comparison of training accuracy

Fig. 8. Comparison of loss value

accuracy increases faster. This is because, QoI-RA dataset has the quality constraints. It selects high-quality images in all datasets. In Fig. 8, the loss value of the model trained with QoI-RA datasets falls to 0 more quickly. This indicates that the model trained with high-quality datasets actually has better learning ability. The growth speed of an initial model can be accelerated by the improvement of image quality in some degree. At last, the prepared testing datasets of BelgianTS (2520 images) are used to test the accuracy of trained model, the final classification accuracy can reach 95.7862% and the time spent for classifying is much shorter than before. It shows that our mechanism is helpful for obtaining high-quality data, with which the growth speed of model can be accelerated (Table 3). Table 3. Testing results DataSet

Predictive indexes Test accuracy Testing time (s) Original dataSet 90.4365% 1527.42 QoI-RA dataSet 95.7862% 1238.04

6 Conclusion A QoI-Aware incentive mechanism (QoI-RA) to provide high-quality datasets for model training has been proposed in this work, which maximizes the social welfare with the subject of quality requirement of each subtask, timeliness of joining task, correlation and coverage of targets. Through extensive simulation results, we show that the proposed mechanism produces close-to-optimal social welfare noticeably. Datasets acquisition through our mechanism is helpful for machine learning model growth with lower complexity. We believe that our method could lay a foundation of the design of incentive mechanisms for multimedia crowdsensing with QoI constraints over machine learning system.

Incentivizing Multimedia Data Acquisition

157

Acknowledgements. The authors gratefully acknowledge the support and financial assistance provided by the National Natural Science Foundation of China under Grant No. 61502230, 61501224 and 61073197, the Natural Science Foundation of Jiangsu Province under Grant No. BK20150960, the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 15KJB520015, and Nangjing Municipal Science and Technology Plan Project under Grant No. 201608009.

References 1. Guo, B., Han, Q., Chen, H., et al.: The emergence of visual crowdsensing: challenges and opportunities. IEEE Commun. Surv. Tutor. PP(99), 1 (2017) 2. Li, Y., Jeong, Y.S., Shin, B.S., et al.: Crowdsensing multimedia data: security and privacy issues. IEEE Multimed. 24(4), 58–66 (2017) 3. https://btsd.ethz.ch/shareddata/ 4. Restuccia, F., Ghosh, N., Bhattacharjee, S., et al.: Quality of information in mobile crowdsensing: survey and research challenges. ACM Trans. Sens. Netw. 13(4), 34 (2017) 5. Howard, A.G., Zhu, M., Chen, B., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861 (2017) 6. Hsu, W.N., Glass, J.: Extracting domain invariant features by unsupervised learning for robust automatic speech recognition, arXiv preprint arXiv:1803.02551 (2018) 7. Leroux, S., Molchanov, P., Simoens, P., et al.: IamNN: iterative and adaptive mobile neural network for efficient image classification, arXiv preprint arXiv:1804.10123 (2018) 8. Hara, K., Sun, J., Moore, R., et al.: Tohme: detecting curb ramps in google street view using crowdsourcing, computer vision, and machine learning. In: Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, pp. 189–204 (2014) 9. Anguelov, D., Dulong, C., Filip, D., et al.: Google street view: capturing the world at street level. Computer 43(6), 32–38 (2010) 10. Sun, F., Huang, G.B., Wu, Q.M.J., et al.: Efficient and rapid machine learning algorithms for big data and dynamic varying systems. IEEE Trans. Syst. Man Cybern. Syst. 47(10), 2625– 2626 (2017) 11. Man, H.C., Hou, F., Huang, J.: Delay-sensitive mobile crowdsensing: algorithm design and economics. IEEE Trans. Mob. Comput. PP(99), 1 (2018) 12. Xu, Y., Zhou, Y., Mao, Y., et al.: Can early joining participants contribute more? - timeliness sensitive incentivization for crowdsensing (2017) 13. Myerson, R.B.: Optimal auction design. Math. Oper. Res. 6(1), 58–73 (1981) 14. Cheng, Y., Li, X., Li, Z., et al.: AirCloud: a cloud-based air-quality monitoring system for everyone (2014) 15. http://www.fda.gov/MedicalDevices/Safety/ReportaProblem/ucm385880.htm 16. Krontiris, I., Albers, A.: Monetary incentives in participatory sensing using multi-attributive auctions. Parallel Algorithms Appl. 27(4), 317–336 (2012) 17. Duan, L., Kubo, T., Sugiyama, K., et al.: Incentive mechanisms for smartphone collaboration in data acquisition and distributed computing. In: Proceedings of IEEE INFOCOM, pp. 1701–1709 (2012) 18. Faltings, B., Li, J.J., Jurca, R.: Incentive mechanisms for community sensing. IEEE Trans. Comput. 63(1), 115–128 (2014) 19. Yang, D., Xue, G., Fang, X., et al.: Incentive mechanisms for crowdsensing: crowdsourcing with smartphones. IEEE/ACM Trans. Netw. 24(3), 1732–1744 (2016) 20. Clarke, E.H.: Multipart pricing of public goods. Public Choice 11(1), 17–33 (1971)

158

Y. Gu et al.

21. Groves Jr., T.F.G., Groves, T.: Incentives in Teams[J]. Econometrica 41(4), 617–631 (1973) 22. Feng, Z., Zhu, Y., Zhang, Q., et al.: TRAC: truthful auction for location-aware collaborative sensing in mobile crowdsourcing. In: Proceedings of IEEE INFOCOM, pp. 1231–1239 (2014) 23. Cormen, T.T., Leiserson, C.E., Rivest, R.L.: Introduction to algorithms. Resonance 1(9), 14– 24 (2009)

Toward Performance Prediction for Multi-BSP Programs in ML Victor Allombert1 , Fr´ed´eric Gava2(B) , and Julien Tesson2 1

2

Universit´e d’Orl´eans, LIFO, Orl´eans, France Universit´e Paris-Est Cr´eteil, LACL, Cr´eteil, France [email protected]

Abstract. bsml and multi-ml are functional parallel programming languages “` a la ml” based of the respectively the bsp and multi-bsp bridging models. multi-bsp extends bsp to take into account hierarchical architectures. For both models, it is possible to predict the performances of algorithms thanks to embedded cost models. To do so, we propose formal operational semantics with cost annotations for the two aforementioned languages. This work has been done in a incremental manner. First we recall the cost semantics of core-ml language. Then, we adapt it to bsml and then to multi-ml. It is then possible to evaluate the cost of a program following the annotated semantics. Finally, we compare the theoretical approach with the current implementation on a code example. Keywords: Semantics Time prediction

1 1.1

· bsp · bsml multi-bsp · Cost

Introduction Context

The bulk synchronous parallelism (bsp) bridging model [16] was designed for flat parallel architectures. A bridging model is an abstract model of a computer which provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. But modern high performance computing (hpc) architectures are now hierarchical and have multiple layers of parallelism, communication between distant nodes cannot be as fast as among the cores of a given processor. We now consider the multi-bsp model [17], an extension of bsp. multi-ml [1,2] is a multi-bsp extension of bsml [8], a functional approach for programming bsp algorithms in ml, bsml being itself an extension of ocaml, a ml language (https://ocaml.org/). To be compliant with a bridging model eases the way of writing codes that ensures efficiency and portability from one architecture to another and also avoid deadlocks and non-determinism. The multi-bsp bridging model offers a high level of abstraction and takes into account real communications and synchronisation costs on hierarchical architectures. Thanks to the cost model embedded in c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 159–174, 2018. https://doi.org/10.1007/978-3-030-05057-3_12

160

V. Allombert et al.

the (multi-)bsp model, it is possible obtain the cost of a given algorithm. Using the (multi-)bsp parameters of an architecture allows to predict the execution time of a given code. That can useful for resource bound analysis and find performance bugs thus to provide development-time feedback to hpc programmers. We chose ocaml (with our own distributed extensions) as the source language “` a la ml” for several reasons. For one, ocaml is a widely used language for functional programming which is quite efficient in practice (sophisticated compiler and automatic memory management). Moreover, we wanted to demonstrate that it is possible to define a practical cost semantics for high-level hpc languages; imperative programming is closer to standard assembly codes which already have their cost analysis such as wcet [15]. Even if functional programming is currently not the norm for hpc, it is more and more common that main stream languages (such as java) add functional features. Studying these features in ml, without having to manage others features (such as java’s objects), is a classical manner to get them for other languages. Cost prediction is important for the design of efficient algorithms and is also important in domains where programs are executed with time constraints (such as in physical engines such as aeroplanes etc.). In the future, even such domains would benefit of many-cores architectures (at most). Cost prediction of hpc programs is thus an important issue to ensure the safety of such systems. 1.2

Example of the Methodology: The Sequential Case

An important first step to study cost prediction of programs is to define the cost of the construction of the language itself, that is define an operational big-step semantics that assign a parametric cost to a well-formed expression. Having a compositionnal cost semantics is also an important issue in order to get modular and incremental programming: from a software engineering point of view, it makes senses that the cost of a subprogram does not depend (too much) on the context, for example, the cost of an array sorting method should depend only on the size of the input and not when it is called. The main hypothesis is that the resource consumption of a program is a linear combination of the number of executions of each construct in the program1 . The semantics models this idea by parameterizing the cost with unknown coefficients that correspond to each ml construct: the number of executions of each of these constructs constitutes the majority of the execution time of most ml codes [10]. Taking the case of the core-ml language. It relies on a minimal set of ml constructions. This set is sufficient enough to express all the behaviour that are used in ml programming. Thus, features such as records, modules, pattern matching, sum types are excluded. The grammar is: e ::= cst Constants | let x = e in e Binding | op Operators | fun x → e Function | x Variables | rec f x → e Recursive function | (e e) Application | if etheneelsee Conditional 1

But their combination could be not linear as for algorithms with polynomial or exponential complexities.

Toward Performance Prediction for Multi-BSP Programs in ML

161

In this grammar, x and f range over an infinite set of identifiers. We also find the typical ml-like constructors such as let for bindings and also fun and rec for, respectively, functions and recursive functions. As expected, the application is denoted (ee). For the sake of readability, we take the liberty to use the familiar infix notation for binary operators, as well as the usual precedence and associativity rules. When the context is clear, we can avoid the usage of parentheses. op stands for the standard operators, such as common computations on integers. cst stands for constants such as integers, booleans, etc. An expression is evaluated into a value v which are defined as: v ::= op | cst | (fun x → e)[E] | (rec f x → e)[E] E ::= {x1 → v1 , ...xn → xn } Values contains constants and closures (a value which stores both a function and its environment). An environment E is interpreted as a partial mapping with finite domain from identifiers to values. The extension of E by v in x is written E  {x → v}. An inference rule can be written as following: P E e⇓vC That is with the premise P, the expression e is evaluated to the value v at cost C. The cost (time and memory) consumed by each construct is averaged out to be a constant. Hence, the execution time of a program C is: c∈C nc×Tc where C represents the set of constructs and nc is the count of each construct during the whole program execution, and Tc is the execution time of the respective constructs. Estimating the overall time execution of a program (in “seconds”) from the semantics now consists to estimating each Tc (in µs) using microbenchmarking2 and replacing them into the extracted cost C. The inference rules for core-ml are defined in Fig. 1 and work as follow. The Csts and Ops rules do not generate any additional cost. Indeed, we assume that they are static values which are accessible freely. Vars aims to access a value bound in a memory using the lookup operator (which returns the corresponding bound value). As this operator access a value stored in a memory, its cost should be proportional to the path trough different caches-memories. However, we chose to set such a constant Tvar in order to simplify the rules. The Closure rule mainly models the way the values are enclosed inside a function closure. It is done using the select operator which, given an environment E and a function (code) returns the minimal environment to evaluate such a code. We assume that the cost of building such an environment is proportional to the number of free variables (F, define by trivial induction on expressions) of e. It is an approximation which can be refined by taking into account more ocaml mechanisms. Recursive functions are build in the same way. The app, let and if rules are straightforward: we simply propagate the cost produced by each expressions. Note the modification of the environment for the application to evaluate the code of the closure. Also, each operator gets a cost noted c3 in the rule and we note op v the new built value. The “s” on the rules 2

This assumption does not truly holds for most of the relevant platforms (e.g. the garbage collector and caches-misses) but is still sufficient for our study; We let more subtle analyses to future works and we will focus on parallelism.

162

V. Allombert et al.

Fig. 1. The cost semantics of the sequential core-ml language.

that are unused here but will be necessary for the bsp’s supersteps. It is also straightforward to show that ⊕ is commutative. 1.3

Outlines

In this article we introduce the formal cost semantics of first the bsml language (Sect. 2) and then we extend it to multi-ml (Sect. 3). For both languages, we first present the model of execution, then the cost model and we give the semantics annotated with costs for core languages that describes the syntax of the aforementioned languages. Finally, we compare the predicted execution times with the actual one on a small example (Sect. 4).

2 2.1

BSP Programming in ML and Costs Semantics The BSP Bridging Model In the bsp model [16], a computer is a set of p uniform pairs of processor-memory with a communication network. A bsp program is executed as a sequence of supersteps (Fig. 2), each one divided into three successive disjointed phases: (1) each processor only uses its local data to perform sequential computations and to request data transfers to other nodes; (2) the network delivers the requested Fig. 2. A bsp superstep.

Toward Performance Prediction for Multi-BSP Programs in ML

163

data; (3) a global synchronisation barrier occurs, making the transferred data available for the next superstep. As bsp architecture can be easily mapped on any general purpose parallel architecture. Thanks to the bsp cost model it is possible to accurately estimate the execution time of a bsp program with the bsp parameters. The performance of a bsp computer is characterised by four parameters: The local processing speed r; The number of processors p; The time L required for a barrier; The time g for collectively delivering a 1-relation. g and L can be expressed in fLoating-point operations (flops) and r in flops per second. To accurately estimate the execution time of a bsp program, these 4 parameters can be easily benchmarked [3]. A 1-relation is a collective exchange where every processor receives/sends at most one word. The network can deliver an h-relation in time g × h. The execution time (cost) of a superstep s is the sum of the maximal local processing time, the data delivery and the global synchronisation times. It is expressed by the following formula: Cost(s) = max0≤i


The BSML Language

bsml [7] uses a small set of primitives and is currently implemented as a library (http://traclifo.univ-orleans.fr/bsml/) for the ml programming language ocaml. An important feature of bsml is its confluent semantics: whatever the order of execution of the processors is, the final value will be the same. Confluence is convenient for debugging since it allows to get an interactive loop (toplevel). That also simplifies programming since the parallelisation can be done incrementally from an ocaml program. A bsml program is built as a ml one but using a specific data structure called parallel vector. Its ml type is ’a par. A vector expresses that each of the p processors embeds a value of any type ’a. Figure 3 resumes the bsml primitives. Informally, they work as follows: let  e be the vector holding e everywhere (on each processor), the   indicates that we enter into the scope of a vector. Within a vector, the syntax $x$ can be used to read the vector x and get the local value it contains. The ids can be accessed with the predefined vector pid. When a value is referenced within the scope of a parallel vector, its locality is l (local); otherwise, the locality is b (bsp).

Fig. 3. Summary of the bsml primitives.

164

V. Allombert et al.

The proj primitive is the only way to extract local values from a vector. Given a vector, it returns a function such that applied to the pid of a processor, returns the value of the vector at this processor. proj performs communication to make local results available globally and ends the current superstep. The put primitive is another communication primitive. It allows any local value to be transferred to any other processor. It is also synchronous, and ends the current superstep. The parameter of put is a vector that, at each processor, holds a function returning the data to be sent to processor j when applied to j. The result of put is another vector of functions: at a processor j the function, when applied to i, yields the value received from processor i by processor j. 2.3

Cost Semantics

Extension. To obtain core-bsml, we extends the expressions of core-ml with parallel primitives as follow: e ::= · · · | replicate (fun − → e) | (proj e) | (put e) | (apply e e). The distinction made between the syntactic sugar (the   and $ notations), used when programming bsml algorithms, and the core parallel primitives (replicate and apply), available in the semantics only, simplifies the semantics. Indeed, the syntactic sugar eases the way of programming but it is not suitable for the semantics as it introduces implicit assumptions. Thus, we must transform and abstract the syntactic sugar using the core parallel primitives. The transformation applied to switch from the syntactic sugar to the core parallel primitives is straightforward and produce and equivalent expression. The parallel vector scope, denoted  e , is transformed using the replicate core primitive. Thus,  e is simply transformed into replicate(fun − → e). The $ syntax is transformed using the apply primitive. The transformation is simple and does not require a complicated expression analysis. To do so, we build a vector of functions that takes, as argument, the dollar’s annotated value. Using the apply primitive, we can apply this vector of functions on the vector of values. For example, the expression  (e $x$)  is transformed into apply(replicate(fun − x → ex))x. Values are also extended with parallel vectors: v ::= · · · | < v1 , . . . , vp . In the following, to simplify the notations, we indices processors from 1 to p (and not from 0 to p − 1 as common in hpc). We make also the hypothesis that there exists a special vector named pid=< 1, · · · , p > (the ids of the processors). The main modification is about the costs. During a superstep, the asynchronous costs are counting independently and it is only during the barrier that the maximal of the costs (computation and communication) are to be taken into account. But a same superstep can be in two different parts of an expression (for example let x= 1+1 in ((proj  $x$+1 ) 2) where the begin of the first superstep is in the first part of the let , the next just before the call of the proj and the second superstep when apply the result of the proj on the constant 2). For this reason, we extends the costs with vector of costs < c1 , . . . , cp >s where each component i describe the current local cost ci of processor i during the superstep s. This s is modify only by the rules of synchronous primitives. Nevertheless, we add the three following equivalences:

Toward Performance Prediction for Multi-BSP Programs in ML

165

1. < c1 , . . . , cp >s ⊕ < c1 , . . . , cp >s ≡< c1 ⊕ c1 , . . . , cp ⊕ cp >s , if ci and ci does not contains vectors 2. < Top ⊕ c1 , . . . , Top ⊕ cp >s ≡ Top ⊕ < c1 , . . . , cp >s , whatever Top 3. 0 ≡< 0, . . . , 0 >s , whatever s These rules aims to keep using the previous rule of the sequential constructions of the languages (let, fun, etc.). Lemma 1. The costs with parallel vector of costs form a commutative and associative group id where 0 is the neutral element inside or outside cost vectors and where < 0, . . . , 0 >s is the neutral element outside vectors only. Adding Rules. We must now extend our inference rules in order to take into account the bsp primitives. These rules are given in Fig. 4. They work as follow.

Fig. 4. The cost semantics of the core-bsml language.

The Rpl rule is for building asynchronously a new parallel vector. The expression e is evaluated for each component, in parallel, making a new vector of cost for the current superstep s. The valid function is used to forbid nested vectors and is fully defined in [1]. A type system has been designed to not be forced to do this check dynamically. Then a construct is linearly add. The Apply rule works similarly but for two expressions which thus add two different costs (not necessary vectors and for possibly different supersteps) and we finally built the vector by computing its components in parallel (on each processor) making the linear add of a new costs vector. The Proj rule adds a barrier (L) and thus finishes the superstep (updating s). From the exchanged computing values, a h-relation is added: g and L are thus special constructs. The put cost is quite dense because of the number of communications between all the processors which are done during the evaluation of the primitive. But the rule is close the proj one. For sake of conciseness, we do not show it. The way the data sizes are computed by simple induction on the values (Hrelation): it is rather naive but sufficient to an upper born.  To get the overall execution time E s e ⇓ v s c then it is max(c) ⊕ L where the function max first apply the three previous equivalences in order to aggregate (merge) the cost vectors of the same superstep until not merging is

166

V. Allombert et al.

possible. Finally, when the cost (time and memory) consumed by each construct is statically known in µs then max(< c1 , . . . , cp >s ) = ci if ∀j = i, cj ≤ ci . Lemma 2. max is idempotent that is ∀c max(max(c)) = max(c). For example, let x= in ((proj ) 2) beginning with whatever environment E at any superstep s, for a two processors bsp machine, the cost semantics indicates that the adding cost of such expression is: < T+ , T+ >s ⊕Trpl ⊕ Tapp ⊕ < Tvar ⊕ T+ , Tvar ⊕ T+ >s ⊕1 × g ⊕ L ⊕ Tapp (2 vectors constructions both with an addition; a synchronous primitive; and a final application). That is to say, in any context, the expression adds T+ during the asynchronous phase of the current superstep s, finishes it and begins a new superstep. On it own, the cost of such an expression can be simplify into 2×T+ ⊕ g ⊕ L.

3 3.1

Multi-BSP Programming in ML and Costs Semantics The Multi-BSP Bridging Model

multi-bsp is a bridging model [17] which is adapted to hierarchical architectures, mainly clusters of multi-cores. It is an extension of the bsp bridging model. The structure and abstraction brought by multi-bsp allows to have portable programs with scalable performance predictions, without dealing with low-level details of the architectures. This model brings a tree-based view of nested components (sub-machines) of hierarchical architectures where the lowest stages (leaves) are processors and every other stage (nodes) contains memory. Every component can execute code but they have to synchronise in favour of data exchange. Thus, multi-bsp does not allow subgroup synchronisation of any group of processors: at a stage i there is only a synchronisation of the subcomponents, a synchronisation of each of the computational units that manage the stage i − 1. So, a node executes some code on its nested components (aka “children”), then waits for results, does the communication and synchronises the sub-machine. A multi-bsp algorithm is thus composed by several supersteps, each step is synchronised for each sub-machine. An instance of multi-bsp is defined by d, the fixed depth of the (balanced and homogeneous) tree architecture, and by 4 parameters for each stage i of the tree: (pi , gi , Li , mi ): pi is the number of sub-components inside the i − 1 stage; gi is the bandwidth between stages i and i − 1: the ratio of the number of operations to the number of words that can be transmitted in a second; Li is the synchronisation cost of all sub-components of a component of i − 1 stage; mi is the amount of memory available at stage i for each component of this stage. Thanks to those parameters, the cost of a multi-bsp algorithm can be computed as the sum of the costs of the supersteps of the root node, where the cost of each of these supersteps is the maximal cost of the supersteps of the sub-components (plus communication and synchronisation); And so on. Let Cji be the communication cost of a superstep j at stage i: Cji = hj ×gi+Li where hj the maximum size of the exchanged messages at superstep j, gi the

Toward Performance Prediction for Multi-BSP Programs in ML

167

communication bandwidth with stage i and Li the synchronisation cost. We can express the cost T of a multi-bsp algorithms as following: T =

d−1 i −1  N

(

i=0

wji + Cji )

j=0

where d is the depth of the architecture, Ni is the number of supersteps at stage i, wji is the maximum computational cost of the superstep j within stage i. It is to notice that the bsp and multi-bsp cost models both are a linear combination of costs for the asynchronous computations and costs of communications (separated by barriers). 3.2

The Multi-ML Language

multi-ml [1,2] (https://git.lacl.fr/vallombert/Multi-ML) is based on the idea of executing bsml-like codes on every stage of a multi-bsp architecture. This approach facilitates incremental development from bsml codes to multi-ml ones. multi-ml follows the multi-bsp approach where the hierarchical architecture is composed by nodes and leaves. On nodes, it is possible to build parallel vectors, as in bsml. This parallel data structure aims to manage values that are stored on the sub-nodes: at stage i, the code let v= e evaluates the expression e on each i − 1 stages. Inside a vector, we note #x# to copy the value x stored at stage i to the memory i − 1. The (mkpar f) primitive is an alternative way to build a vector using a function f . Typed ( int → α) → α par, it aims to execute the given function to each processor identifiers (from 0 to pi − 1) of a node locally on it; and then, distribute the results down to its sub-nodes. The main difference with the  e notation is that (mkpar f) aims to reduce costs when the communication costs of e is high and the execution cost of f and its result is low. As in bsml, we also found the proj , put primitives and the syntax $x$, all of them with the same semantics. We also introduce the concept of multi-function to recursively go through a multi-bsp architecture. A multi-function is a particular recursive function, defined by the keyword let multi, which is composed by two codes: the node and the leaf codes. The recursion is initiated by calling the multi-function (recursively) inside the scope of a parallel vector, that is to say, on the sub-nodes. The evaluation of a multi-function starts (and ends) on the root node. The following code shows how a multi-function is defined. After the definition of the multi-function mf on line 1 where [args] symbolises a set of arguments, we define the node code (from line 2 to 6). The recursive call of the multi-function is done on line 5, within the scope of a parallel vector. The node code ends with a value v, which is available as a result of the recursive call from the upper node. The leaf code, from lines 7 to 9 consists of sequential computations.

168

V. Allombert et al.

We also propose another parallel data structure called tree. A tree is a distributed structure where a value is stored in every nodes and leaves memories. A tree can be built using a multi-tree-function, with the let multi tree keyword and can be handled by several primitives of the language. We do not detail this construction here. Similarly to bsml and its b and l localities, in multi-ml we introduce m when a value refers to the multi-bsp locality and s on leaves (sequential). 3.3

Cost Semantics

Extension. To obtain core-multi-ml, we extends core-bsml with multifunctions as follow: e ::= · · · | (down x) | multi f x → e † e. The multi-function definition is written with the keyword multi. It takes one arguments and two expressions separated by the † symbol; the first argument stands for the node code and the second is for leaf code. The down primitive aims to transfer a value to all the sub-nodes. The transformation from the # syntax into the down primitive is obvious and work as other syntactic sugars of bsml. For example, the expression > is transformed into apply(replicate(fun − x → ex))(downx). As the # annotated value is given as argument of the vector of functions, there are no redundant copies. The expression > is transformed into a code that copy x to the sub-nodes, only once. Parallel vectors of values (and costs) now also depend of their deep level n in the multi-bsp architecture. Closures of multi-functions are also added. Thus we have v ::= · · · | < v1 , . . . , vpn > | (multi f x → e † e)[E]. Adding Rules. We must now extend our inference rules in order to take into account the multi-functions and the nested bsml codes. These rules are given in Fig. 5. They work as follow.

Fig. 5. The cost semantics of the core-multi-ml language.

These new rules need some updates of the previous rules. First, the ⇓ is parameterized by the different levels of execution of multi-ml and the stage n (beginning from 1). bsml rules has to be trivially updated with this stage in order to build the right size vectors.

Toward Performance Prediction for Multi-BSP Programs in ML

169

As a node is a particular component where it is possible to express bsp parallelism, we must consider the synchronous costs generated by bsp computations. Those rules, at a stage n, are used to recurse trough the multi-bsp architecture using the multi-function. Therefore, the max function now first merge the vectors of the same (sub)superstep and finally we use this following equivalence (for each superstep s): max(n1 × T1 ⊕ · · · ⊕ nt × Tm ⊕ < c1 , . . . , cpn >s ) ≡ max(n1T1 ⊕ · · · ⊕ nt ×Tt , maxi=1..pn (ci )) that is we take the maximum between the computation of the node parent with the max of its own children. The MultiCall rule is for calling the multi-function at the level m. The counter of superstep is initiated to 0 as the stage to 1. The code of the node begins (level b). This rule terminates with a whole and synchronous broadcasting of the final value v where g = g1 + g2 ... + gd (as well for L); This is due to the model of execution of multi-ml where the code outside multi-function is run by all the processors in order to manage the whole execution and thus the value must be known by all the processors. The maximum function allow to get the right cost of all child. The rule is possible only if v is valid (as in bsml). Our type system forbids expressions that have not this property [1] and we can assume that all the evaluated expressions are correct. The MultiLeaf goes to leaf level. The number of supersteps still the same when going throw the leaf level (only sequential codes are allow). The MultiNode is for going throw the hierarchical architecture (inside a vector) from one node to another one (the child). Thus the stage is incremented. A final synchronisation is used to finally wait all the child before terminating the node code (the recursive call of the multi-function). This allow to take the maximum of computation of the sub supersteps as wanted in the multi-bsp cost model. In multi-ml, the building of a vector is an asynchronous operation with a emission of a signal of creation from the node processor to the subnodes (or leaves). It is thus no longer possible using the second equivalence of the ⊕ which only becomes commutative between two Ln (barrier) at a stage n. It is to notice that the Lookup function need also to check the variable at the right memory. Indeed, a variable define in at the stage n is no available on another stages. To do this, one must adding indices in the environment E. More details are available in [1]. Here, only the MultiNode and MultiLeaf rules can be evaluated. The costs of the multi-function recursive call taking place on both the node and the leaf is simple. We just add the evaluation cost of e1 and e2 , plus the multifunction call cost, resulting in the recursive call. The MultiNode rule adds the Ci costs which result from the potential asynchronous computations done on the node. Thus, we collect all the costs engendered by multi-function recursion. As expected, this mechanism is not necessary on the MultiLeaf rule, as there is no parallel computation at this level.

4

Experiments

Thanks to the cost model embedded in the multi-bsp model, it is possible to estimate the evaluation cost of a multi-ml program. According to the multi-bsp

170

V. Allombert et al.

parameters standing for a machine specification, it is then possible to predict the execution time of a program. To verify that the cost estimation retrieved from the multi-bsp cost formulae is valid, we are going to compare the computation time of a simple algorithm to the predicted computation cost. To do so, we propose to analyse a matrix vector product algorithm based on the map/reduce skeleton. Using the multi-bsp parameters of the targeted architecture able to predict the computation time of various inputs. Our example has been written in a functional style using tail-recursive functions but thanks to the ocaml compiler, these functions are transformed into an efficient imperative version. 4.1

Algorithm Description

We consider a simple algorithm to compute the product of a matrix and a vector. Given a matrix M of dimension n × m, where n stands for the number of lines and m form the number of columns, and a vector V of dimension n (number of lines) the computation is the following:M × V = x, such as x = (x0 , ..., xn ) where x is n composed by m lines and xi = j=0 Mij × Vj . Now, to propose a parallel version of this matrix vector product, we choose to use the map/reduce skeleton [6]. Using map/reduce algorithms is an easy way to propose parallel algorithms using simple associative and commutative operators. A map/reduce algorithms works as following: (1) the data are distributed among the processing units; (2) the map operator is applied on each piece of data; (3) the reduce operator is used to combine the results; (4) the final result is thus obtained. To implement the matrix vector multiplication we define: a map operator which compute the product of a matrix and a vector; and a reduce operator which takes i sub-matrices of size (n , m) and assemble them into a (i × n , m) matrix. The bsp cost of the bsp algorithm is: Q(i) × Tmap ⊕ Q(i) × g ⊕ Q(i) × Tred ⊕ L where Q(i) stands for the total amount data stored at processor i. The multid bsp cost of the multi-bsp algorithm is: S(0) × Tmap ⊕ i=1 (S(i − 1) × gi−1 ⊕ Li−1 ) ⊕ S(i) × Tred ) where Tmap (resp. Tred ) is the time of the mapping (resp. reducing) and S(i) stands for the total amount data stored at level i; for example, we have N × M/2/2 elements on each leaf of a dual-core with two thread per core. We assume the following size (quantity of memory) of values such as SizeOf (float) = 64Bytes and SizeOf (floatarray) = n × SizeOf (float) if the array contains n elements. We omit small overheads and alternative costs relative to each level for the sake of simplification. Furthermore, the cost of serialisation of the data is taken into account in the g parameter. 4.2

Algorithms Implementation

The bsml codes or mapping/reducing and their descriptions are available in [7,8]. In the context of multi-bsp functional programming, we must now write the map/reduce matrix vector product algorithm using the multi-ml language. As the multi-ml language uses a tree based hierarchical way of executing code,

Toward Performance Prediction for Multi-BSP Programs in ML

171

the map/reduces algorithms are almost embedded in the syntax of the language. Indeed, the map phase consists in mapping a function toward the leaves of the multi-bsp architecture, while the reduce phase is basically the combination of the results toward the root node. In the map/reduce implementation, we assume that the values were previously distributed such as each leaves already contains the sub-matrices and nodes are empty. Thus, the distribution is handled by a tree data structure of matrices. As in our implementation a matrix is represented by a one dimension array, the input data is typed α array tree. The map multi-function is written in Fig. 6 (left). As expected, we call recursively the multi-function map toward the leaves. When reached, the leaves are going to apply the map operator f on their data stored in tda (the tree distributed array of sub-matrices). Then, we build a tree which contains the results on leaves.

Fig. 6. Codes of the multi-ml mapping (left) and reducing (right).

After reaching the leaves using the recursive calls, the reduce multi-function simply retrieve the sub-results of its sub-nodes from rc . It transform the parallel data structure into a local array using to array and apply the reduce operator of each sub-matrices. Finally, the resulting matrix is used to propagate the result to the root node (Fig. 6, right). 4.3

Performance Predictions

Benchmarks were performed on the following architecture: mirev2 8 nodes, each with 2 quad-cores (amd 2376 at 2.3 Ghz) with 16 GB of memory per node and a 1 Gbit/s network. Based on the computation and communication cost of each phases it is possible to compute the cost of the proposed algorithm. To do so, we use the multi-bsp parameters which can be estimated using the probe method [3]. We use the following parameters: g0 = ∞, g1 = 6, g2 = 3 and g0 = 1100, g1 = 1800, g2 = 0 and L0 = 149000, L1 = 1100, L2 = 1800, L3 = 0. For bsp we get g = 1500 and L = 21000. Thank to a micro-benchmarking library [13] of ocaml, we have estimated the execution time of the main operators which are used in the map operator: multiplication, get a value from an array etc. The timings for each operators are available

172

V. Allombert et al. Table 1. Operator timings in μs. TDef = 2.921

TLet = 1.312 TGet = 1,324

TBoolAnd = 0.184

TClo = 0.167

TV ar = 0.619 TF loatAdd = 0,881

TIntEq = 0.284

TF unApp = 1.505 TSet = 1,778 TF loatM ult = 1,317

in Table 1 where Tmult , Tadd , Tset and Tget are respectively standing for multiplication, addition, affectation and read in an array. We have neglect the times to build the closures (and apply them) for both multi-functions and the recursive functions since most of the computations come from mapping and reducing. Thus, we have that Tmap = 3×Tget ⊕Tset ⊕2×TF loatM ult ⊕3×TF loatAdd ⊕2× TBoolAnd ⊕2×TIntEq +10×TV ar and Tred = Tget ⊕Tset ⊕5×Tvar ⊕TIntAdd ⊕TIntEq . As the cost of such atomic operations are prone to significant variation because of the compilation optimisation, loops structures and cache mechanisms, we assume that those costs is “a good approximation” of the average computation time needed by these operations. A more precise approaches can be found in [10]. The performance prediction compared to the execution time of the matrix vector multiplication can be found in Fig. 7. We perform the tests for both bsml and multi-ml. We do not used all the cores since our current multi-ml implementation needs specific processes to handle nodes (which is not the case for bsml) and thus we want to be fair for the cost analysis. Note that it is a too small example and bsml is sometime more efficient than multi-ml. A comparison between the two languages on bigger examples is available in [1]. The tests has been done for 2 nodes (left) and then for 8 nodes (right). We can observe that the performance prediction is coherent to the execution time of the algorithm (and its polynomial complexity). The curves slopes are similar even not very accurate. This is mainly due to the fact that the sequential cost of our method is no fine enough. For example, because this is a toy example, we do not use the cache possibilities of the multi-bsp model and thus multi-ml suffers for some miss-caches that are not currently predicted. The garbage collector of ocaml can also disturb the prediction.

5

Related Work

Close to bsp, the logp [5] models are, most of the time, used to study network capabilities and low-level libraries such as mpi. Extensions of bsp such as [14] were proposed to allows sub-synchronisations. Hierarchical approaches were also proposed in [4]. Parallel algorithmic skeleton are often use to proposed a cost prediction based on a structured approach, as in [9]. In [12], a shape analysis techniques developed in the fish programming language is used to propose language with an accurate, portable cost model. Resource Aware ml (raml) [10] allows to automatically and statically computes the resource-use bounds for ocaml programs. A version for parallel (multithreading) and sequential composition was proposed.

Toward Performance Prediction for Multi-BSP Programs in ML

173

Those models seems not adapted to our approach as they do not provide both simplicity and accuracy for hierarchical architectures with a structured execution scheme.

Fig. 7. Performance prediction compared to execution time for bsml and multi-ml; For 2 nodes (left) and 8 nodes (right).

6

Conclusion

Overview of the Work. In this article we propose a formal semantic with cost annotations allowing cost prediction of multi-bsp algorithms. We propose a set of rules adapted to a (core) version of a sequential and purely functional version of ml. Then, we extend this semantics to allows bsp, and then, multi-bsp codes. Thanks to this incremental approach, we propose a restrained set of rules allowing cost prediction of multi-bsp algorithms. To expose the usability of the cost model embedded in the semantics, we compare the performance prediction and actual benchmarks on several parallel architectures. As our approach is simplified and consider abstract bsp and multi-bsp parameters and also is based on the estimated execution time of atomic operation, it may suffers to accuracy issue. We show that our cost estimation is close to the execution time on a simple map/reduce algorithm apply to a matrix-vector multiplication. Future Work. An interesting use of this cost semantic is to propose a analysis able to statically infer a cost of a given algorithm. Such an approach is available for programming imperative bsp algorithm [11] and could be extended to functional multi-bsp programming using an approach similar to the one proposed in [10]: It would be possible to give the cost of a program at compile time.

174

V. Allombert et al.

References 1. Allombert, V.: Functional Abstraction for Programming Multi-Level Architectures: Formalisation and Implementation. Ph.D. thesis, UPEC (2017) 2. Allombert, V., Gava, F., Tesson, J.: Multi-ML: programming multi-BSP algorithms in ML. J. Parallel Prog. 45(2), 20 (2017) 3. Bisseling, R.H.: Parallel Scientic Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004) 4. Cha, H., Lee, D.: H-BSP: a hierarchical BSP computation model. J. Supercomput. 18(2), 179–200 (2001) 5. Culler, D., et al.: LogP: towards a realistic model of parallel computation. In: Principles and Practice of Parallel Programming, pp. 1–12. ACM (1993) 6. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 7. Gava, F.: BSP functional programming: examples of a cost based methodology. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008. LNCS, vol. 5101, pp. 375–385. Springer, Heidelberg (2008). https://doi.org/10.1007/9783-540-69384-0 43 8. Gesbert, L., Gava, F., Loulergue, F., Dabrowski, F.: Bulk synchronous parallel ML with exceptions. Future Gener. Comput. Syst. 26(3), 486–490 (2010) 9. Hayashi, Y., Cole, M.: Static performance prediction of skeletal parallel programs. Parallel Algorithms Appl. 17(1), 59–84 (2002) 10. Hoffmann, J., Das, A., Weng, S.C.: Towards automatic resource bound analysis for OCaml. In: Principles of Programming Languages. POPL 2017. ACM (2017) 11. Jakobsson, A.: Automatic Cost Analysis for Imperative BSP Programs. Int. J. Parallel Prog. (Feb 2018) 12. Jay, C.: Costing parallel programs as a function of shapes. Sci. Comput. Prog. 37(1), 207–224 (2000) 13. Roshan, J., et al.: Core bench: Micro-benchmarking library for OCaml (2014) 14. de la Torre, P., Kruskal, C.P.: Submachine locality in the bulk synchronous setting. In: Boug´e, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par 1996. LNCS, vol. 1124, pp. 352–358. Springer, Heidelberg (1996). https://doi.org/10. 1007/BFb0024723 15. Abella, J., et al.: wcet analysis methods: pitfalls and challenges on their trustworthiness. In: IEEE Symposium on Industrial Embedded Systems, pp. 39–48 (2015) 16. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990) 17. Valiant, L.G.: A bridging model for multi-core computing. J. Comput. Syst. Sci. 77(1), 154–166 (2011)

Exploiting the Table of Energy and Power Leverages Issam Ra¨ıs1 , Laurent Lef`evre1(B) , Anne-C´ecile Orgerie3 , and Anne Benoit1,2 1

´ Laboratoire LIP, Ecole Normale Sup´erieure de Lyon & Inria, Lyon, France {issam.rais,laurent.lefevre,anne.benoit}@inria.fr 2 Georgia Institute of Technology, Atlanta, GA, USA 3 Univ. Rennes, Inria, CNRS, IRISA, Rennes, France [email protected]

Abstract. Large scale distributed systems and supercomputers consume huge amounts of energy. To address this issue, a large set of hardware and software capabilities and techniques (leverages) exist to modify power and energy consumption in large scale systems. Discovering, benchmarking and efficiently exploiting such leverages, remains a real challenge for most of the users. In this paper, we define leverages and the table of leverages, and we propose algorithms and predicates that ease the reading of the table of leverages and extract knowledge from it.

1

Introduction

Data centers worldwide consumed around 194 terawatt hours (TWh) of electricity in 2014, or about 1% of total demand [2]. This worrying consumption has direct financial and environmental consequences on data center managers, like Cloud providers and supercomputer operators. Several techniques have been developed in order to lower the electrical consumption of data centers. These techniques, that we call leverages, can improve the energy efficiency of data centers at different levels: hardware, middleware, and application. Hardware leverages include Dynamic Voltage and Frequency Scaling (DVFS) [11] and shutdown techniques [10]. At the middleware level, energy-efficient resource allocation policies for job managers are examples of leverages [7]. Finally, leverages at the application level include green programming [1]. While many of these leverages have been independently studied in the literature, few works consider the utilization of several leverages at the same time, and no more than two leverages. Yet, the utilization of a given leverage can impact both the utilization and the efficiency of another leverage. The variety of leverages is added to the data center’s complexity, in terms of size and hardware heterogeneity, and makes energy efficiency complex to reach for the users who have access to multiple leverages. In this work, we aim at extending the current state of the art, which is studying the influence of one or two leverages at maximum at the same time, thus ignoring the impacts incurred by the utilization of more leverages. Thus, we c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 175–185, 2018. https://doi.org/10.1007/978-3-030-05057-3_13

176

I. Ra¨ıs et al.

proposed a generic definition, combination and knowledge extraction of multiple leverages in order to fully explore their combined impacts. We propose a first approach toward a completely automated process to characterize the leverages available on a data center node. The key idea of our contribution consists in providing hints to users about the most suitable solution for their application from a defined score table with a value for each leverage combination and each studied metric. Through these tables could be derived knowledge about leverage combination and effects they incur on each other. From the definition of a table of leverages, a tool to help a user, a developer or an administrator to choose which leverage or leverage combination suits the best his objectives (here with a focus on energy or power metrics), the contribution of this paper consists in the algorithms proposed to extract knowledge about the interaction of leverages and their influence on a given metric. The remaining of this paper is structured as follows. Section 2 formalizes the concept of leverages, and illustrates this formalism on the leverages under consideration in this paper. Section 3 defines and explains how to build the table of leverages. Section 4 presents the experimental setup and a first full example of table of leverages. Section 5 then shows how to exploit the raw data of the table of leverages and extract useful knowledge. Finally, Sect. 6 concludes this work and gives perspectives.1

2

Leverage Definition

In this section, we first propose a formalization of a leverage. Second, we apply this formalism to the leverages that we selected for this paper. Definition 1. A leverage L is composed of S = {s0 , s1 , . . . , sn }, the set of available valid states of L, and sc , the current state of L. Thus, an energy or power leverage is a leverage that has a high impact on the energy or power consumption of a device through its various states or through the modification of its current state. Switching from one state to another can have a cost in terms of time and energy. Yet, in the current work, we focus on studying the impacts of leverage combinations over a single intensive application phase [4], and thus we do not study the switching costs between states. In this paper, we consider multiple leverages available on current hardware, namely multi-thread, computation precision and vectorization. These leverages belong to different categories of leverages: application level with computation precision and vectorization techniques, and middleware level with multithreading. These leverages are described hereafter. 1

This work is supported by the ELCI project, a French FSN project that associates academic and industrial partners to design and provide software environment for very high performance computing. Experiments were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities (https://www.grid5000.fr).

Exploiting the Table of Energy and Power Leverages

177

Multi-thread Leverage. The first studied leverage is a middleware-level leverage that permits the usage of multiple cores during computation. OpenMP [5], a well-known application programming interface abstraction for multi-threading, can be used to exploit this intra-node parallelism of multi-cores. It consists of a set of directives that modifies the behavior of the executed code, where a master thread forks a specific number of slave threads that run concurrently. This multi-thread leverage increases the CPU utilization of the node. Consequently, because of the non-power proportionality of current hardware architectures [10], this leverage can improve the energy efficiency of the node. In the rest of the paper, the multi-thread leverage is denoted by nbThreads with the set of states {1, . . . , nmax }, where 1 means that one OpenMP thread is used, and nmax corresponds to the maximum number of threads that could be launched simultaneously on the node. In this work, only the extreme states, 1 and nmax , are explored. Computation Precision Leverage. The second leverage belongs to the application level and exploits the various computation precision options available on actual hardware (i.e., int, float, double). Such a leverage alters the precision of the results computed by the application, but lower precision translates into shorter data representation and so, less computation and less energy consumption. At the application level, the user can specify a desired Quality-of-Service that can be expressed as accessible computation precision states. This precision leverage is denoted by P recision, and the set of states is {int, float, double}, corresponding to the data format for the application. For each of these states, a different code version is provided. Vectorization Leverage. Finally, the last studied leverage concerns the application level. Current CPUs allow the usage of vectorization capabilities to exploit intra-core parallelism. On Intel architectures, it started with MMX instruction in Pentium P5 architectures in 1997 [9]. It was then extended to SSE [6]. SSE was then extended to SSE2, SSE3, SSSE3 and finally SSE4. AVX [8] then introduces new instructions, followed by AVX2 and finally AVX512 available in XeonPhi architecture. In this paper, we focus on SSE3 and AVX2, which are representative of the SSE and AVX families. These instruction sets permit single instruction on multiple data (SIMD) at application level. This vectorization leverage is denoted by V ectorization. The set of states is {none, SSE3, AVX2}, where none means that no vectorization is used. For each of these states, a different code version is provided using the specific intrinsics and adequate compilation flags for each version. The proposed leverage formalism described above is used in the rest of the paper to easily describe the state of each considered leverage and the possible combinations of leverages. The three leverages studied here are chosen to be representative examples of available leverages on modern architectures and frequently used during HPC applications. The methodology proposed in this paper is designed to be applied to any number and any type of leverages.

178

3

I. Ra¨ıs et al.

The Table of Leverages

We describe the table of leverages, which relies on metrics and benchmarks to characterize the performance and energy impact of each leverage combination on a given node. For each metric and each benchmark, a score is attributed to a given leverage combination. The table is then used to extract knowledge about each leverage and evaluate impacts of leverage combinations in order to help the users to utilize their computing infrastructure in a more energy-efficient way. Metrics. Leverages may influence the quality of service or performance of an application. For instance, shutdown techniques may induce latency in waking up the required nodes. Consequently, for these leverages, users need to determine their acceptable trade-off between energy-related metrics and performance metrics. The table of leverages relies on three different metrics that represent both energy and performance constraints. These metrics are measured for a given period of time corresponding to the time spent during benchmark execution. The two first metrics are energy and power related metrics. To define them, we introduce the following notations: T = {t0 , . . . , tN } is the set of time stamps of energy consumption measurements of a given run; t0 and tN represent the starting and ending timestamps (with a distance of one second), respectively; pj , j ∈ [0, N ], represents the power consumption (in Watt), of the considered node for the timestamp tj . Metric 1: The average power consumption  of an executable is denoted avrgW att, and it is defined as avrgW att = j∈[0,N ] pj /(N + 1). Metric 2: The energy consumption of an executable is denoted Joules. It represents the energy consumption of the complete node used between t0 and tN .  It is defined as Joules = j∈[0,N −1] (tj+1 − tj ) × pj . Metric 3: The last metric concerns the performance of the run, and is expressed as the execution time, denoted T ime. It includes whole execution time of an executable, including initialization. Benchmarks. A benchmark corresponds to a self-contained application that is representative of typical applications or portions of applications. The benchmark is compiled before the run, and once launched, the metrics previously defined are collected during its execution. Here, for the sake of clarity, we evaluate only one benchmark for a set of embedded leverages. We chose to focus on a well-known CPU intensive code: the line per line matrix multiplication (LpL MM) of dense random large squared matrices (8192 as dimension size). The same algorithm is implemented for the various leverage combinations. The considered leverages are multi-thread, computation precision and vectorization. For the last two leverages, a different state means a different version of code, here generated by hand using dedicated intrinsics and compilation flags (-O3 -msse3 -mavx2). We deactivated the auto vectorization of the compiler (-fno-tree-vectorize) to have a control over the chosen intrinsics and because auto generation of vectorizable code is not one of the focused leverage in this paper.

Exploiting the Table of Energy and Power Leverages

179

Formalization of the Table of Leverages. Here, we describe how to compute the score associated to each metric for each leverage. Let X, Y, Z be the sets of available states of three leverages χ, ψ, ω (corresponding to S, the set of states for a given leverage L, from Definition 1): X = {x0 , . . . , xnx }, Y = {y0 , . . . , yny }, and Z = {z0 , . . . , znz }. Let g1 , . . . , gm be the measured metric functions, as for instance avrgW att, Joules, and T ime. For all u (1 ≤ u ≤ m), gu (xi , yj , zk ) is the value of metric gu for the states xi , yj , zk for the leverages χ, ψ, ω. In the table of leverages, each line corresponds to a combination of states for each leverage and the columns correspond to the measured metrics. We normalize each value on the minimum value for each metric. These normalized values constitute the scores indicated in the table of leverages. Let h1 , . . . , hm be the normalized versions of g1 , . . . , gm . So, we have, for 1 ≤ u ≤ m, gu (xi ,yj ,zk ) hu (xi , yj , zk ) = min gu (x  ,y  ,z  ) , with hu (xi , yj , zk ) being the value x  ∈X,y  ∈Y,z  ∈Z i j k

i

j

k

in the table of leverages in column of metric u and corresponding to the line for the states xi , yj , zk respectively for the leverages χ, ψ, ω. For application-level leverages, here Precision and Vectorization, the chosen benchmarks correspond to a different combination of application leverage states. Leverage nbThreads changes its state through environment variable. When all states are covered, the table of leverages is complete for the considered benchmark. Reducing the creation time of such a table is not the focus of this paper.

4

Building and Analyzing the Table of Leverages

In this section, we present the table of leverages built on a node from our experimental testbed, Grid’5000 [3]. Grid’5000 deploys clusters linked with dedicated high performance networks in several cities in France. As our focus is on energy and performance related metrics, we used the Lyon site, where the energy consumption of every computing node is monitored through a dedicated wattmeter, exposing one power measurement per second with a 0.125 Watts accuracy. The Nova cluster from Lyon is used in the following. This cluster contains Dell PowerEdge R430 with 2 CPU E5-2620 v4 of 8 cores each, 32 GB of memory, 2 HDD disks of 300 GB each. We applied our previous methodology for the three chosen leverages to the CPU intensive benchmark. This allows us to explore all possible states of chosen leverages, and thus to build a complete table of leverages. The table has the following format: the first three columns present the states of the nbT hreads, P recision, and V ectorization leverages respectively, while the last three columns show the normalized results of the three metrics avrgW att, Joules, and T ime, respectively, for every combination of leverage. As can be seen in Table 1 (first six columns), a line represents results of all gathered metrics for the execution of a representative load for a chosen combination of leverages. The results are normalized as explained before. The table of leverages gathers the knowledge of a Nova node, for a given workload done for multiple states of leverages combined.

180

I. Ra¨ıs et al.

Table 1. Normalized table of leverage states and ranked impact for line per line matrix multiplication (LpL MM) benchmark on a Nova node. Leverage states

Table of leverages

Ranked impact

nbThreads (T) Prec. (P) Vector. (V) avrgWatt Joules Time avrgWatt Joules Time 1

int

none

1.05

65.09

61.89 P,T,V

P,T,V P,T,V

1

int

SSE3

1.06

28.26

26.56 P,V,T

V,P,T V,P,T

1

int

AVX2

1.06

29.32

27.67 P,V,T

V,P,T V,P,T

1

float

none

1.05

72.97

69.67 P,V,T

P,T,V P,T,V

1

float

SSE3

1.06

33.8

31.89 V,P,T

V,P,T V,P,T

1

float

AVX2

1.05

36.8

34.89 P,V,T

V,P,T V,P,T

1

double

none

1.06

81.59

76.89 P,T,V

P,T,V P,T,V

1

double

SSE3

1.07

58.52

54.89 V,P,T

V,P,T V,P,T

1

double

AVX2

1.06

57.72

54.22 P,V,T

V,P,T V,P,T

32

int

none

1.43

13.48

9.44 P,T,V

T,P,V T,P,V

32

int

SSE3

1.4

4.68

3.33 P,V,T

T,V,P T,V,P

32

int

AVX2

1.0

1.0

1.0

P,V,T

T,V,P T,V,P

32

float

none

1.45

7.4

5.11 P,T,V

T,P,V T,P,V

32

float

SSE3

1.41

3.76

2.67 V,P,T

T,P,V T,P,V

32

float

AVX2

1.56

3.11

2.0

P,V,T

T,V,P T,V,P

32

double

none

1.53

8.34

5.44 P,T,V

T,P,V T,P,V

32

double

SSE3

1.53

8.52

5.56 V,T,P

T,P,V T,P,V

32

double

AVX2

1.54

7.0

4.56 P,T,V

T,V,P T,V,P

Explanation of the Table: A lot of unexpected results, at first sight, are detected in Table of leverage 1, like the combination with int being better than float and double when 1 and none are the chosen state for the nbThread and Vectorization leverages, with this trend being reversed with nbThreads=32. From the set of combination with 1 as the chosen state for leverage nbThreads, it is logic to see that int is quicker than float then double from a cache usage perspective. Indeed, more data can be brought into the cache to compute without the need to fetch new data compared to float or double representation that need more space for the same amount of elements. As for the SSE and AVX combinations, we have tremendous gain while using it compared to None, as it uses vectorial capabilities of the used core. Using a leverage usually comes with a cost. This statement is also true for the Vectorization leverage. An operation on vectors has costs, even if it is low. For instance, it is known that loading and saving vectors has a non null cost. With only one active thread, the current architecture, Broadwell here, allows turbo boost, a technology that permits to reach a much higher frequency that the available ones (here it can reach 3.0 GHz, when average frequency is 2.1 GHz). Also, when the OS detects too much load on a core, it context switches the running process and runs it on another core. Hence, the kernel saves the states (stack, registers) of the current process and loads it on another core, implying a storing and loading cost of the

Exploiting the Table of Energy and Power Leverages

181

given process. This phenomenon can happened several times during a second. Thus, saving and charging states can create a lot of cache misses, which could be dramatical with usage of vectorization, where loading and saving vectors is not free. As AVX has longer vectors, its operation costs on vectors can be longer than SSE. Thus, it starts to be beneficial only when comparing double combinations for such a Vectorization leverage. When threads are up to 32, data is more likely to be shared between caches of various used cores. Without the previous struggles from caches for one core and because it is also well known that floating points operations(float and doubles here) are well optimized on current architectures and perform better than integers, {32, float, none} and {32, double, none} perform better than {32, int, none}. All threads are sharing data on separated cache, SSE and AVX outperforms the none configuration, with AVX always outperforming SSE for a fixed combination. Due to this data repartition between caches implied by the chosen configuration of the nbThreads leverage, there is enough computation to overcome costs of larger vector operations, here AVX for all combinations. Note that the best combination for all metrics used here is always the {32, int, AVX2} combination. This result is the best combination to choose only if we have no constraints about leverage choices. It is expected to see variation, as leverages highly modulate the usage of nodes, either from intensity of usage for example of caches, core usage, availability of specific leverages (like seen with turbo boost with one thread). Results of metrics from combination of leverages is thus complicated to fully understand without a detailed knowledge of the architecture, the underlying used leverages and their influences on a given context. We propose predicates that helps a user underline such interesting points of interest from the table of leverages. For example, this table could help a user to choose a combination taking into account a fixed leverage state. Or to answer the following question: is there a leverage or a state of leverage that is always better for a given metric?

5

Exploiting the Table of Leverages

In this section, we describe the main contribution of this paper: a methodology to exploit the table of leverages and to extract useful knowledge, such as the influence and impact of one or multiple leverages on a given metric or set of metrics. We propose two focuses for extracting a score for each leverage. The first one corresponds to the actual table: it normalizes the results of a given metric for every explored configuration. The second one computes a ratio of contribution for each leverage in order to expose the most relevant leverage (the one with the largest contribution to the considered metric). We define four exploitation predicates that ease the analysis of the table, and answer questions. We illustrate these predicates and the answers of these questions on the selected table (Table 1). These questions target a single metric, hu . Question 1: Is a selected combination of leverages states the best one for metric hu ? If a given combination is always the best, it means it should

182

I. Ra¨ıs et al.

always be applied, if possible, if one wants to optimize hu . Consider a combination of states xa , yb , zc of leverages χ, ψ, ω for metric hu . We need to check whether for all i ∈ [0, . . . , nx ]\{a}, j ∈ [0, . . . , ny ]\{b}, and k ∈ [0, . . . , nz ]\{c}, we have hu (xa , yb , zc ) ≤ hu (xi , yj , zk ). On Nova nodes and for the three leverages (Table 1), the best combination for all three studied metrics is {32, int, AVX2}. Question 2: When I fix a state, do I always improve metric hu ? Consider state xa of leverage χ. We want to check whether for all i ∈ [0, . . . , nx ]\{a}, for all l, j ∈ [0, . . . , ny ], and for all m, k ∈ [0, . . . , nz ], we have hu (xa , yl , zm ) ≤ hu (xi , yj , zk ). On the example of Table 1, for the Joules and T ime metric, only the nmax (here, 32) state of nbThreads leverage answers this predicate, meaning that using this state will always be beneficial. No specific results can be obtained with this question for the avrgW att metric, meaning that no leverage state is always better for this metric when used. Question 3: If some states are fixed for a subset of leverages, is a given state for the remaining leverages the best choice to optimize hu ? Consider that the state of leverages ψ, ω is fixed to yb , zc . We are asking whether state xa of leverage χ is the best choice for metric hu . Therefore, we need to check whether for all i ∈ [0, . . . , nx ]\{a}, we have hu (xa , yb , zc ) ≤ hu (xi , yb , zc ), which tells for instance that for the fixed combination {32, SSE3}, the best state for the Precision leverage is float, when considering the Joules or T ime metric (Table 1). Although, when focusing on avrgW att as the studied metric, for the {32, SSE3} fixed combination, the best state for the Precision metric is int. If only state zc for leverage ω is fixed, and we consider states xa and yb of leverages χ and ψ respectively, we check whether for all i ∈ [0, . . . , nx ] and for all j ∈ [0, . . . , ny ], we have hu (xa , yb , zc ) ≤ hu (xi , yj , zc ). Concerning the Joules metric (Table 1) for the fixed state float of the Precision leverage, the best combination for the nbThreads and Vectorization leverages is {32, AVX2}. However, for the avrgW att metric, fixing again the state float of the Precision leverage, the best combination is now {32, SSE3}. Applying this predicate allows us to extract some unexpected results. Concerning the Joules and T ime metrics, for the Precision and Vectorization leverages, no state emerges as the best one. In fact, it highly depends on the chosen state of other leverages. One could for instance expect int to always be the best state, but when comparing the {32, double, none} with {32, int, none}, we see that the double combination is more effective than the int combination. Similar conclusions can be drawn when the Vectorization leverage is used. AVX2 has larger vectors than SSE3, thus we would expect it to be always more efficient. However, when nbThreads state is equal to 1, {1, float, SSE3} is more effective than {1, float, AVX2}, leading to a different best choice when combined to the nmax state (here, 32), where {32, float, AVX2} is more effective than {32, float, SSE3}. Note that this combination emerges as the best one when SSE3 is fixed. Concerning the avrgW att metric, we also get unexpected knowledge. In opposition to the Joules and T ime metrics, no state emerges as the best one for none of the studied leverages. As AVX2 has larger vectors than SSE3, we would expect it to always stress more the CPU, thus always having higher values for

Exploiting the Table of Energy and Power Leverages

183

this metric. It is the case with the {32, float} and {32, double} combinations. However, it is not observed with other combinations. When nbThreads=1, int is always the best choice to minimize this metric, whatever the chosen state for Precision and Vectorization leverages. Moreover, when Vectorization and nbThreads are set to any studied states, int is also always the best choice to minimize the avrgW att metric. Question 4: Given a combination for all the leverages, how can we rank the states in terms of contribution for metric hu ? To answer this question, we consider a set of states xa , yb , zc of leverages χ, ψ, ω. Then, for each state w ∈ {xa , yb , zc }, we compute the contribution score mc(w) for this state on hu (xa ,yb ,zc ) metric hu as follows. For state xa of leverage χ, mc(xa ) = max hu (xi ,yb ,zc ) . i∈[0,...,nx ]

We define similarly the contribution of states for the other leverages ψ and ω. Then, we rank the contribution scores mc(xa ), mc(yb ), mc(zc ) in ascending order to answer the question. Table 1 (last three columns) presents the scoring related to the table of leverages. For the best combination {32, int, AVX2}, the ranking goes as follows for the Joules metric: “T,V,P” or “nbThreads, Vectorization, Precision”, meaning that the chosen state for T here is the most contributing state in this combination, followed by the V, and then P states. Thus, for this combination, the precision leverage with the int position has the lowest contribution. This ranking points out unexpected results for the Joules metric. We notice a switch between two positions of a given leverage for the fixed combination of other leverage states: {32, double}. In fact, when comparing the scoring of {32, double, SSE3} with {32, double, AVX2}, we get respectively “T,P,V” and “T,V,P”. In the first case, double and SSE3 have the same worst possible score, 1.0, meaning that it is the worst state of this leverage for this combination. In the second case, AVX2 scores better than SSE3 and thus, it is above double. When nbThreads=1, we note that combinations including SSE3 and AVX2 states always have the Vectorization leverage state as the most contributing one, which leads to the conclusion that it is always better to use SSE3 and AVX2 states for the Vectorization leverage. For the {32, float, SSE3} combination, we get the scoring “T,P,V”. float gets a better score and thus a better position than SSE3 because it is the best leverage state for the {32, SSE3} combination, leading to the conclusion that choosing float instead of other Precision leverage states contributes more than choosing SSE3 instead of other Vectorization leverage states for this combination. For the avrgW att metric, scoring underlines the fact that when choosing int as a state of Precision leverage, and for a fixed state of the Vectorization leverage, the sorting is always the same. In fact, {32, int, none}, {32, int, SSE3} and {32, int, AVX2} get the exact same sorting of contribution that {1, int, none}, {1, int, SSE3} and {1, int, AVX2}, respectively. Moreover, int is always the most contributing leverage state, which shows that int is always a good choice to improve this metric. This scoring also underlines the fact that in order to minimize the avrgW att metric, a user should better focus on P and V leverages, asT is never the most contributing one. This scoring

184

I. Ra¨ıs et al.

highlights results that would have been difficult to notice just by looking at the table. It allows a user to quantify how much a leverage position used in a combination contributes to the overall performance for a given metric.

6

Conclusion

Energy efficiency is a growing concern. In the context of HPC and datacenters where the size of infrastructures grows drastically, energy consumption has to be taken into account as a high expense. There is a wide range of techniques, that we formally define as leverages, that permits to modulate the computing capabilities and/or the energy/power used by a device. We propose a generic solution to extract fine grain knowledge and hints from the table of leverages, thanks to the defined predicates. Our solution underlines new knowledge about leverages alone and about combinations of leverages. Thus, it allows us to extract influences of leverages on each other and understandable knowledge by the user. Knowledge could be extracted from a table on CPU-intensive workload. For example, our solution underlines the fact that if Precision is set to the double state, it is always better to use it with AVX2 state for the Vectorization leverage to minimize the Joules metric. Also, for Vectorization fixed to the SSE3 state, our solution tells us that float is the best state to minimize the Joules metric. We also underline the fact that some unexpected behavior can be seen when combining leverages. For example, we underline the fact that changing float or int to double for Precision, and keeping the SSE3 state activated for Vectorization state, turns out to be counterproductive for the Joules metric. The first short term future work is the parallelization of the creation of the table of leverages in order to improve the time needed to build it. Then, we plan to apply this methodology on other non CPU-intensive phases, such as IO, HDD, and RAM-intensive phases with appropriate leverages for every phase. Finally, a future working direction would be to extend this methodology to costly transition leverage states, as for instance shutdown policies. Also, we would like to investigate how to reduce the completion time for building such a table. In fact, the time to solution here could be greatly reduced, for example by predicting which run is not needed to know values of relevant metrics using learning or prediction techniques.

References 1. Acar, H., Alptekin, G.I., Gelas, J.-P., Ghodous, P.: Towards a green and sustainable software. In: Concurrent Engineering, pp. 471–480 (2015) 2. International Energy Agency. Digitalization & Energy. White paper (2017) 3. Balouek, D., et al.: Adding virtualization capabilities to the Grid’5000 testbed. In: Ivanov, I.I., van Sinderen, M., Leymann, F., Shan, T. (eds.) CLOSER 2012. CCIS, vol. 367, pp. 3–20. Springer, Cham (2013). https://doi.org/10.1007/978-3319-04519-1 1

Exploiting the Table of Energy and Power Leverages

185

4. Chetsa, G.L.T.E.A.: A user friendly phase detection methodology for hpc systems’ analysis. In: IEEE International Conference on and IEEE Cyber, Physical and Social Computing (2013) 5. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998) 6. Gallas, B., Verma, V.: Embedded Pentium (R) processor system design for Windows CE, Wescon/98, pp. 114–123. IEEE (1998) 7. Georgiou, Y., Glesser, D., Rzadca, K., Trystram, D.: A scheduler-level incentive mechanism for energy efficiency in HPC. In: CCGrid, pp. 617–626 (2015) 8. Lomont, C.: Introduction to intel advanced vector extensions. Intel White Paper, pp. 1–21 (2011) 9. Peleg, A., Weiser, U.: MMX technology extension to the Intel architecture. IEEE Micro 16(4), 42–50 (1996) 10. Ra¨ıs, I., Orgerie, A.-C., Quinson, M.: Impact of shutdown techniques for energyefficient cloud data centers. In: Carretero, J., Garcia-Blas, J., Ko, R.K.L., Mueller, P., Nakano, K. (eds.) ICA3PP 2016. LNCS, vol. 10048, pp. 203–210. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49583-5 15 11. Suleiman, D., Ibrahim, M., Hamarash, I.: Dynamic voltage frequency scaling (DVFS) for microprocessors power and energy reduction. In: International Conference on Electrical and Electronics Engineering (2005)

A Semantic Web Based Intelligent IoT Model Chao Qu , Ming Tao(B) , Jie Zhang , Xiaoyu Hong , and Ruifen Yuan School of Computer Science and Network Security, Dongguan University of Technology, Dongguan 523808, China {quc,zhangjie,hongxy,yuanrf}@dgut.edu.cn, [email protected]

Abstract. Different from the sensor network, the devices in the intelligent Internet of Things (IoT) should be able to organize and coordinate spontaneously to accomplish specific tasks. By taking advantage of various intelligent technologies, we proposed an intelligent IoT model based on the Semantic Web. The framework consists of top ontology, entity link layer, semantic label layer, service register center, transaction construction layer, and transaction execution control layer. For the sake of constructing and executing the transactions automatically in the intelligent IoT, entity functions are represented by Semantic Web Services. Additionally, the framework also acts as a manager during the execution of a transaction and makes effective management and control to the entities. We demonstrated the effectiveness and superiority of the proposed model with a case study of the comprehensive rescue service for transportation accidents. Keywords: Intelligent IoT

1

· Semantic Web · IoT framework

Introduction

The ultimate purpose of IoT is to realize the smart interconnection between objects and many applications have applied [1]. The logic expression ability, knowledge discovery and reasoning capabilities of the Semantic Web ingratiate with the needs of further development of the Internet of Things. The Semantic Web has become an important technology for promoting the Internet of Things. In the past few decades, semantic techniques have provided means for description, information sharing, and integration of heterogeneous objects. Moreover, the artificial intelligence and knowledge engineering are combined in the field of the Semantic Web to represent and process data and knowledge.

2

Related Work

The corresponding research results of the Intelligent IoT include the Smart-M3 system which aims to merge the Semantic Web and IoT domain. The system c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 186–195, 2018. https://doi.org/10.1007/978-3-030-05057-3_14

A Semantic Web based Intelligent IoT Model

187

provides a semantic publishing and subscription software architecture [2]. Tao et al. [3] used semantic ontology to manage devices in smart home. Wu et al. [4] proposed a unified knowledge framework to improve the Semantic Web of Things (SWoT) for interoperability between IoT applications in specific areas. Jahan et al. [5] discussed a ubiquitous knowledge base, CoAP-based framework and intelligent gateway for SWoT framework. Gyrard et al. [6] proposed the SEG3.0 as a joint, unified approach to semantic interoperability and applied it to smart city systems. Poslad et al. [7] proposed a new IoT EWS system framework, a semantic information organization model, which aims to explore the scalability of the IoT [8]. Singh et al. [9] proposed a new IoT architecture model using Semantic Fusion Model (SFM), which uses intelligent semantic framework to encapsulate information gathered from sensor networks. The most representative ones are the Semantic Web of Things model proposed by Floriano [10].

3

Problem Statement

The purpose of the Intelligent IoT is to realize the direct correlation between the information space and the physical world and achieve complete intelligent interaction. Since the specific process is completed without the involvement of human activities, the problems are more complex and diverse. The intelligent interaction between the human society, information space and physical world “Fig. 1” is different from the IoT and also different from the Semantic Web. The complexity is increasing geometrically and the system will face more unprecedented problems and challenges. The problems include the following four aspects: The existing frameworks of Intelligent IoT mainly focused on the information processing and do not substantially involve the driving and control of entities. In essence, the existing framework still research on information field. There are still many issues about device connection are not mentioned. The existing frameworks of Intelligent IoT do not completely separate from manual control to achieve true intelligence. The ultimate purpose of the Intelligent IoT is to hand over all information processing to the machine or entity. People only act as perceivers of the end result and do not need to participate in the query, composition and processing. The existing frameworks cannot provide solutions for such intelligent

Fig. 1. Interactions in IoT, SW and SWoT.

188

C. Qu et al.

development. The existing frameworks of Intelligent IoT cannot provide support for complex process construction. Although the existing frameworks can settle the problems of service composition, it did not provide support for service composition which coupled with dynamic entity information. The existing frameworks of Intelligent IoT cannot manage and control the execution of complex processes. Physical world entities need an effective scheduling mechanism and a mechanism to resolve errors is also needed. In some conditions system must resolve errors immediately and eliminate the expected impact of errors.

4

Semantic Web Based Intelligent IoT Model

To settle the problems in Sect. 3, we proposed a Semantic Web based intelligent IoT model (ISWM). And in this section we will explain it in detail. 4.1

Framework

Based on previous research, We proposed a more reasonable Intelligent IoT model, as shown in “Fig. 2”.

Fig. 2. Framework of Semantic Web based intelligent IoT model.

A Semantic Web based Intelligent IoT Model

189

Top Ontology. The top ontology used to represent the concepts and relationships involved in the IoT. It is the basis of the framework and provides logical reasoning for the whole architecture [11]. Entity Link Layer. The entity link layer implements functions such as information transfer and drive control for various entities, and is an intermediate layer between the logic part and the entity. This layer should not only discover the added physical devices, convert various types of device communication protocols and drive devices, but also acquire the devices status information in real time to provide the basis for the establishment and execution of transactions. Semantic Annotation Layer. The semantic annotation layer implements the semantic annotation of the original data in order to provide semantic support for the upper application. This layer also packages the semantic information of entities into Semantic Web Services. This layer is mainly divided into three functional modules: The semantic tag database is a set of semantic representation tags for existing entities, similar to the DTD in XML. The semantic annotation module labels the functions and features of the underlying entities with the normalization tags, providing a basis for selection during transaction construction. The function of the service package module is to encapsulate semantically labeled entity functions into Web services. Service Registry Center. In the semantic annotation layer, the entity information is converted into machine-readable semantic format. The content of this information includes two parts: static attributes, such as state and external environment description, and the functional attributes, which is packaged as a Web Service. Using a similar approach to the Web service, this information is stored on a registry or cloud platform and can be updated using periodic queries or transaction triggers. The service register can use the existing technologies in SOA to store and manage the services provided by the IoT entity. The service registry center can adopt a centralized management mode and a distributed management mode. Its working principle and implementation technology can also directly draw on UDDI. Transaction Construction Layer. This layer regards the Web Service as dynamically configurable resources for management and scheduling. The main function of this layer is to build the service chain to meet user requirements. There are two steps for transaction construction: semantic decomposition of user requires and service discovery and composition. The function module includes the following four: The context analysis module analyzes the requirements of semantic information delivered by the user interface through ontology reasoning and a priori rules, and then makes a definitive judgment on the requirement according to the corresponding context. The requirement decomposition module uses the ontology and its inference rules to decompose the request into sequential calls of several entities, which is translated into Web services. The function of the service query module is to find the Web service that satisfies the conditions in the registration center according to the result of requirement decomposition.

190

C. Qu et al.

The function of the service composition module is to organize the queried services and build a transaction that meets the requirements of users. Transaction Execution Control Layer. The main function of this layer is to manage and control the execution of IoT transactions. Including: state information, the definition of entities states in the transaction; State awareness of subsequent entities during execution; Dynamic replace entity or terminate the transaction when error occurs; Synergistic scheduling or process consolidation between synergistic IoT transactions. The function module includes the following four: The transaction status set is a predefined rule set, which defines a set of states, environment requirements et., which should be fulfilled during the execution of the IoT transaction. The transaction status set must be determined in advance by using the ontology reasoning rules based on the semantic information while constructing the IoT transaction. The entity status query module is directly associated with the device state module in the entity link layer, and obtains the state information of the entities in real time. When the error control module encounters a service failure that represents an entity’s function during the execution of the IoT transaction, it determines whether or not the pre-driver result is retained and how the successor sequence is handled according to the semantic environment. The scheduling module controls the concurrently, cooperatively or mutually exclusive IoT transactions, and scheduling non-concurrent entities according to the actual environment. 4.2

Working Mechanism

The workflow of ISWM includes the following three aspects: Entity Functions are Registered as Web Service. First, the entities connected to the IoT are captured by the device discovery module at the entity link layer. Second, the entities are matched with its driver, which is configured by the device driver module. And then the entities information is passed to the semantic annotation layer for semantic encapsulation. After that, the semantic annotation module represents the entities according to the metadata in the semantic tag database. The service package module expresses the entity functions as Web Services. Finally, the services are registered in service registry center and published by the service publisher module. The process is shown in “Fig. 3”. Construct IoT Transaction to Meet the Users Requirement. User requirement is provided to the system in natural language and passed to the transaction construction layer. The requirement is analyzed by the context analysis module and translated into semantic information which is machine-readable. The semantic information decomposed into a combination of simple requirements by the demand decomposition module. The format of these simple requirements is the same with Web Services. The service query module finds and matches the entity services in the service registry center. The service combination module organizes them to construct the transaction and establishes the transaction status set. The process is shown in “Fig. 4”.

A Semantic Web based Intelligent IoT Model

191

EnƟty connect to the IoT EnƟty discovery

EnƟty configuraƟon

EnƟty informaƟon delivery

Device driver

SemanƟc annotaƟon

SemanƟc presentaƟon of enƟty acƟvity

Protocol pool

SemanƟc tag database

EnƟty acƟvity service package

Service package

Service registry

Service register

Service publish

Service database

Device discovery

Service publisher

Fig. 3. The process of entities registered as web services.

User requirement SemanƟc analysis Service database

Requirement decomposiƟon

Service query

Service query Service composiƟon

Context analysis Requirement decomposiƟon

Service composiƟon

Build transacƟon and status set Fig. 4. IoT transaction construction process.

Control the Execution of IoT Transaction. After the construction of the IoT transaction, services are called in turn according to its logical sequence under the control of the transaction execution control layer. In the service invocation process, the entity status query module detects the state of the entity, which provides the service, in real time through the device state module in the entity link layer, and updates the transaction state set. At the same time, the semantic parameters in the requirements are transferred as driver parameters to the corresponding entities by the information convert module. During the entire transaction execution process, the error control module and the scheduling module in the transaction execution control layer are responsible for the processing and management of errors. The process is shown in “Fig. 5”.

192

C. Qu et al. Start Detect transacƟon status set

Error control

TransacƟon status set

Has service?

Error handling

Y Select the next service

EnƟty status query

Detect the corresponding enƟty

Device state detect

Error control Y EnƟty error?

Scheduling

N Scheduling enƟty

N

End

EnƟty N

Meet the condiƟons?

Device driver

Y InformaƟon convert

Execute service

Fig. 5. IoT transaction execution process.

5

Comparison and Use Cases

5.1

Comparison of Models

The ISWM proposed in this paper compares with the SWoT models [9] and the active service model for IoT (IASM) [10]. The common attributes are as follows: They all need knowledge base (KB) as a support. In IASM and ISWM, Web Service is used as objects of discovery and composition and service composition is used for managing information. The differences between our model and the previous architectures are as follows: ISWM framework adds an entity link layer to the aforementioned structure, which is used to communicate with and control the underlying entities. In ISWM, information not only transmit upward but also downward. A service registry center is added for the unified and standardized management of entity services. In particular, entities can be scheduled to resolve unexpected problems when abnormal conditions occur. The differences are listed in “Table 1”. Table 1. Comparison of SWoT, IASM and ISWM KB support

Service discovery

Entity composition

Entity status feedback

Entity control and schedule

SWoT

Yes

No

No

No

No

IASM

Yes

Yes

Yes

No

No

A Semantic Web based Intelligent IoT Model

193

Table 2. The implementation of transport rescue process in different environments Manual operation

SWoT

ISWM

Information collection

Rescuers

Sensor network

Sensor network

Accident determination

Traffic police

KB

KB and ontology

Decision making

Emergency rescue department

Decision Support Systems

Decision Support Systems

Organization

Emergency Emergency rescue rescue department and department and rescuers rescuers

Transaction execution and control system and rescuers

Information publication

Emergency rescue department

Emergency rescue department or KB

Emergency rescue department

Table 3. Executioner and method in rescue process in different environments SWoT

ISWM

Executioner

Method

Executioner

Method

Information report

Emergency rescue department

By phone

Transaction construction system

Message trigger

Wrecker dispatch

Emergency rescue department

Query and scheduling manually

Transaction execution control system

Policy scheduling message triggering

Ambulance dispatch Emergency rescue department

Query and scheduling manually

Transaction execution control system

Policy scheduling message triggering

Fire truck dispatch

Emergency rescue department

Query and scheduling manually

Transaction execution control system

Policy scheduling message triggering

Traffic control

Traffic police

Command by traffic police

Transaction execution control system

IoT entities such as traffic light and indicator,traffic police if needed

Material dispatch

Emergency rescue department

Preparation and transportation manually

Transaction construction system and transaction execution control system

Intelligent storage and intelligent logistics system scheduling

194

5.2

C. Qu et al.

Case Study of Traffic Rescue

The Intelligent Transportation System (ITS) utilizes the IoT technology to equip the road network to real-time monitoring and exact management. Its most important function is to detect and deal with traffic accidents in time. The model proposed in this paper is compared with the manual operation and SWoT model in the implementation of the comprehensive rescue process for traffic accidents as shown in “Table 2”. We can see that in the comprehensive rescue process of traffic accidents, the proposed model in this paper is as effective as the manual operation and the SWoT structure. The biggest difference in the model compare with the other two structures is the executioners in the integrated rescue procedure as shown in“Table 3”. It can be seen from “Tables 2 and 3”, that in the comprehensive rescue process, the SWoT model can use the sensor network and knowledge system to discover, determine, and formulate rescue strategies for accidents, but it is ineffective for the organization of subsequent rescue work.

6

Conclusion

In order to achieve the intelligent objectives, this paper proposed an intelligent IoT model based on the Semantic Web. We described the framework and working mechanism of the model. The framework uses the ontology as the logical reasoning basis and is divided into several parts: the entity link layer, the semantic annotation layer, the service registry center, the transaction construction layer, and the transaction execution control layer. Semantic technology is used to describe the IoT entity as a dynamic Web Service. In the model, the technologies of service discovery, service composition are used to build IoT transactions that meet users requirements and control the transaction processes. Due to the addition of physical feedback, entity control and scheduling, the advances of our model are shown in the use case of traffic accident rescue. In another work we study the security of the model [12]. Acknowledgment. This work was supported in part by the Natural Science Foundation of Guangdong Province, China (Grant No. 2018A030313014); Guangdong University Scientific Innovation Project (Grant No. 2017KTSCX178); the outstanding young teacher training program of the Education Department of Guangdong Province (Grant No. YQ2015158); Guangdong Provincial Science & Technology Plan Projects (Grant Nos. 2016A010101035 & 2016A010101034); and National Natural Science Fund, China (Grant Nos. 61300198 & 61772233).

References 1. Tao, M., Zuo, J., Liu, Z., Castiglione, A., Palmieri, F.: Multi-layer cloud architectural model and ontology-based security service framework for IoT-based smart homes. Futur. Gener. Comput. Syst. 78, 1040–1051 (2016) 2. D’elia, A., Viola, F., Roffia, L., Azzoni, P., Cinotti, T.S.: Enabling interoperability in the Internet of Things: a OSGi semantic information broker implementation. Int. J. Semant. Web Inf. Syst. 13(1), 147–167 (2017)

A Semantic Web based Intelligent IoT Model

195

3. Tao, M., Ota, K., Dong, M.: Ontology-based data semantic management and application in IoT- and cloud-enabled smart homes. Futur. Gener. Comput. Syst. 76, 528–539 (2016) 4. Wu, Z., Xu, Y., Zhang, C., Yang, Y., Ji, Y.: Towards semantic web of things: from manual to semi-automatic semantic annotation on web of things. In: Wang, Y., Yu, G., Zhang, Y., Han, Z., Wang, G. (eds.) BigCom 2016. LNCS, vol. 9784, pp. 295–308. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42553-5 25 5. Jahan, F., Fruitwala, P., Vyas, T.: Towards the next generation of web of things: a survey on semantic web of things’ framework. In: Satapathy, S.C.C., Das, S. (eds.) Proceedings of First International Conference on Information and Communication Technology for Intelligent Systems: Volume 1. SIST, vol. 50, pp. 31–39. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30933-0 4 6. Gyrard, A., Serrano, M.: Connected smart cities: interoperability with SEG 3.0 for the Internet of Things. In: IEEE 30th International Conference on Advanced Information Networking and Applications Workshops, pp. 796–802 (2016) 7. Poslad, S., Middleton, S.E., Chaves, F., et al.: A semantic loT early warning system for natural environment crisis management. IEEE Trans. Emerg. Top. Comput. 3(2), 246–257 (2015) 8. Sun, Y., Jara, A.J.: An extensible and active semantic model of information organizing for the Internet of Things. Pers. Ubiquitous Comput. 18(8), 1821–1833 (2014) 9. Singh, D., Tripathi, G., Jara, A.J., et al.: A survey of Internet-of-Things: future vision, architecture, challenges and services. In: 2014 IEEE World Forum on Internet of Things, pp. 287–292 (2014) 10. Scioscia, F., Ruta, M.: Building a semantic Web of things: issues and perspectives in information compression. Proceedings of the 2009 IEEE International Conference on Semantic Computing (ICSC), pp. 589–594 (2009) 11. Qu, C., Liu, F., Tao, M.: Ontologies for the transactions on IoT. Int. J. Distrib. Sens. Netw. 11, 1–12 (2015) 12. Qu, C., Tao, M., Zhang, J., Hong, X.Y., Yuan, R.F.: Blockchain based credibility verification method for IoT entities. 2018, Secur. Commun. Netw. 2018, 1–11 (2018)

Accelerating CNNs Using Optimized Scheduling Strategy Rui Xu1, Sheng Ma2(&), Wenwu Li1, and Yang Guo1 1

College of Computer, National University of Defense Technology, Changsha 410073, Hunan, China 2 The State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha 410073, Hunan, China [email protected]

Abstract. Convolutional neural networks (CNNs) have a wide range of applications in image and video recognition, recommender systems and natural language processing. But CNNs are computationally intensive, and its computational cost is hard to accept. In order to speed up the calculations, people focus on optimizing convolution that account for most of the proportion of CNNs’ operation. So, many algorithms have been proposed to accelerate the operation of convolution layers. However, each algorithm has its advantages and disadvantages, and there is no one algorithm that can handle all situations. In this paper, we examine the performance of various algorithms in GPU environment. By building a customized CNN model, we have fully explored the impact of the neural structure on the performance of algorithms, including inference/training speed, and memory consumption. In addition to the algorithms, we also focus on how their implementations in GPU environment affect their performance. Finally, we summarize the characteristics of each algorithm., and design a strategy to assigns the appropriate implementation for different convolutional layers in CNNs. With our strategy, we can make AlexNet run 1.2x to 2.8x faster than other strategies in GPU environment. This work has very important meaning for understanding these algorithms and may provide insights for further optimizations of the architecture of GPUs and accelerators. Keywords: Artificial intelligence  Convolutional neural networks Scheduling strategy  GPU framework

1 Introduction Since deep learning [1] was proposed, it has rapidly become a hot topic. Especially, deep neural networks (DNNs) have made significant progress in image classification, target recognition, speech recognition, language translation, etc. [2]. In some cases, the accuracy of neural network even exceeds the accuracy of human identification [3].

This work is supported by the National Natural Science Foundation of China (No. 61672526) and Research Project of NUDT (ZK17-03-06). © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 196–208, 2018. https://doi.org/10.1007/978-3-030-05057-3_15

Accelerating CNNs Using Optimized Scheduling Strategy

197

A series of successful and mature network models have also been proposed, including Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs) [2], etc. In this paper, we focus on CNNs, which play a key role in image and video recognition, recommender systems and natural language processing [3]. However, the training/inference time of CNNs is long and sometimes unbearable. Due to the complexity of convolution operations, CNNs will bring a huge workload to the device. Meanwhile, CNNs is becoming more and more complicated, since the number of convolutional layers is continually increasing. These changes can bring improvement in the accuracy, but will result in a huge increase in training or inference time. There are many ways to solve this problem, one of which is using acceleration algorithm. At present, there are three popular convolutional acceleration algorithms, Matrix multiplication (GEMM) [5], Fast Fourier Transform (FFT) [6], and Winograd’s minimal filtering algorithm (Winograd algorithm) [7]. GEMM converts convolution operations into more efficient matrix operations, while FFT and Winograd algorithms reduce computational complexity of CNNs. People usually implement CNNs in GPU environment, because GPU uses manycore architectures and have massively parallel processing power [2]. Moreover, NVIDIA has developed a deep learning library called cuDNN, which is a GPU-accelerated library of primitives for deep neural networks [8]. It can provide highly tuned implementations for convolution to accelerate the execution of CNNs in GPU environment. Therefore, more and more users choose GPUs to speed up the execution of CNN. And, cuDNN is also being used more and more widely. Currently, users can choose an appropriate framework to build CNNs model. But they rarely understand the implementations or algorithms of convolution used by these frameworks. Few studies have really shown the differences between these convolution algorithms. In this paper, we show the detailed comparisons on the characteristics between these algorithms. We choose GPUs as the main hardware platform for convolution operation and compare seven most popular implementations of convolution. We choose a customized CNN model as the workload to obtain performance characteristics of these implementations. The customized CNN model was built and trained using the same framework, Caffe [9]. Our work shows that each implementation has pros and cons, and no algorithm can perform best in all situations. The actual performance of these algorithms or implementations will be heavily dependent on the configuration of convolutional layer. Moreover, in the same configuration, the GPU implementation will further affect the performance of these algorithms. This work has very important meaning for understanding these algorithms and may provide insights for further optimizations of the architecture of GPUs and accelerators. Based on the characteristics of algorithm, we provide optimization techniques to implement efficient CNN models using the algorithm libraries. We design an Optimized Algorithm Scheduling Strategy, which assign the appropriate convolution algorithm for each convolutional layer of CNNs. We also designed an experiment to verify the superiority of our strategy. We compare our design with several existing

198

R. Xu et al.

solutions, such as Caffe+CUDA and Caffe+cuDNN. Our experimental show that our strategy can increase the execution speed up to 2.8x compared to Caffe+CUDA and 1.2x compared to Caffe+cuDNN.

2 Background and Related Work 2.1

Convolutional Neural Networks

A Convolutional Neural Networks (CNN) is the feedforward neural network [2]. It has an excellent performance for large-scale image processing and identification. In a CNN, multiple convolutional layers are connected. Such structure allows CNN to abstract the features of the image as much as possible. The main purpose of the convolution layers is to extract the features in the image. They use well-trained filters that are highly responsive to specific good patterns. However, the type of features extracted by different convolutional layers are not the same. In AlexNet, the first layer of convolution is used to detect low-order features such as edges, corners, curves, etc. As the number of convolutional layer increase, the features detected by filters are more complex [10].

Layer 3

Layer 2





Convolution



Fig. 1. The description of the part of AlexNet, which is from Layer 2 to Layer 3, by using the 4D-tensor.

In a CNN, feature data is stored as tensors. In the traditional method, feature images or maps is processed in two dimensions, , where H represents the height of images and W represents the width of images. But there are a lot of images that need to be processed in the same layer, we can treat the feature map data as four dimensions tensor, , where N and C means the number of images in a batch, and the number of channels, respectively. In this way, we can easily describe CNNs’ network structure (see Fig. 1) Similarly, we can also use 4D-tensor to describe the kernels, , where K represents the number of kernels, R represents the height of the kernel, and S represents the width of the kernel. 2.2

Convolution Algorithms

Convolution is the key operation of CNNs. How to carry out these operations efficiently has become a hot research topic. Many algorithms have been proposed and most of them have different implementations in GPU environment.

Accelerating CNNs Using Optimized Scheduling Strategy

199

The formulation of convolution operation is Eq. (1) [5]. Where N means the minibatch size, C means the number of channels, K means the number of filters, R and S mean the filter size, P and Q mean the output size, U means the stride size, F means filters of CNNs, and D means input maps. The traditional method of calculating convolution is based on Eq. (1). We called it the direct convolution algorithm. It completes the multiplication between elements and accumulate their results according to Eq. (1) [11]. It is the most straightforward way to perform convolution. Cuda-convnet2 [11] is a widely used direct convolution library. O½n; k; p; q ¼

C1 X R1 X S1 X

F½k; c; r; s  D½n; c; p  U þ r; q  U þ s;

c¼0 r¼0 s¼0

ð1Þ

n 2 ½0; NÞ; k 2 ½0; KÞ; p 2 ½0; PÞ; q 2 ½0; QÞ: Another algorithm is matrix multiplication (GEMM). It transforms the matrix convolution operation into a matrix multiplication operation as shown in Fig. 2 [5]. Because matrix multiplication has efficient computational libraries in GPU environment, this simpler approach has gained considerable efficiency.

Fig. 2. Transforming the matrix convolution into a matrix multiplication [2]. This process produces redundant data, which marked in red in this figure.

One of the implementations of GEMM in GPU environment is called explicit GEMM. This implementation directly calculates the convolution according to GEMM algorithm flow. But it has the disadvantage that there is redundant data in the input matrix, and they will take up extra memory space. Therefore, implicit GEMM [12] was proposed. It divides these matrixes into small pieces and uses the index to guide the calculation. Small amount of data can be loaded into the on-chip memory directly without taking up extra GPU memory. But this method requires additional calculation of the index and sufficient bandwidth.

Fig. 3. Use FFT to calculate convolution. In the frequency domain space, it can be seen that the small-size filter becomes the same size as the input image, which takes up extra space.

200

R. Xu et al.

Another implementation of GEMM is implicit-precomp GEMM. It is based on implicit GEMM. But unlike implicit GEMM, it does not require index calculation during the operation of convolution. It obtains the index in advance by calculating the parameters of CNNs structure and block size. It can further speed up the calculation, However, it takes up some memory space to store the index. In order to further speed up the operation, Fast Fourier Transform (FFT) is also implemented. It transforms the input and filter data into the frequency domain space and completes these matrices product [6]. Then, the result is transformed back into the time domain space to get the final convolution result (see Fig. 3). FFT speeds up computation by reducing computational complexity of convolution. The number of multiplication of convolution is O(P2  R2) in direct algorithm, whereas the FFT algorithm can reduce the number to O(P2  logP) [2]. The disadvantage of FFT algorithm is that it needs to store a large amount of intermediate data. The transformation of FFT also expands the filter to the size of the input maps [5]. Due to the above reasons, the algorithm needs to take up significant memory space, especially when facing small-size kernels and large-size inputs. To solve these problems, FFT-tiling was proposed, which is another implementation of the FFT algorithm in the GPU. Similar to implicit GEMM, it divides input maps into small tiles. It uses block transmission and calculation to reduce memory usage and hide the latency of the transmission [12]. Another acceleration algorithm is Winograd algorithm. It transforms the multiplication operations into addition operations to reduce the computational cost [7]. By using this algorithm, we can reduce the number of multiplication from O(P2  R2) to O((P + R − 1)2) in the operation of convolution. However, the disadvantage of this algorithm is the lack of good flexibility. When the size of filters changes in CNNs, the parameter matrices used for transformation has to be changed. In addition, the process also generates intermediate data that needs to be stored [12]. 2.3

Related Work

Since convolutional neural networks were introduced to public, few studies focus on the comparison between the convolutional algorithms. At present, the best way to evaluate the convolution algorithms is to refer to experimental data provided by several algorithm developers. Mathieu, et al. (2013) show the performance of the FFT algorithm [13, 19]. Chetlur, et al. (2014) compare implicit GEMM, explicit GEMM and Direct algorithms in their work [5]. Lavin et al. (2015) show the advantages of the Winograd algorithm compared to the GEMM and FFT algorithms [14]. However, through these years of development, the implementations of algorithms in GPU environment have become diversified. For example, the GEMM algorithm has three implementations. Although these implementations execute the same algorithm, their performance is completely different. So it is necessary to conduct a comprehensive evaluation of these implementations. It also should be noted that there are many studies comparing the performance of different DNN frameworks, like [20, 21]. But our work wants to show the characteristics of different convolution algorithms in the same framework. When the user selects

Accelerating CNNs Using Optimized Scheduling Strategy

201

the appropriate framework, we will give our optimization suggestions for reference. Meanwhile, based our experiments’ result, we design an Optimized Algorithm Scheduling Strategy. Through this strategy we can improve the computational efficiency of CNNs.

3 Experimental Methodology We conduct two experiments in our work. In the first experiment, we compare the characteristics between different implementations. we measure the execution time and memory usage of these implementations to compare their characteristics. In order to identify the performance limit factors for each algorithm, we select the customized convolutional neural network as the workload, because it is representative and flexible enough to simulate many conditions. The default structure parameters of the custom network structure are as follows, N = 64 (mini-batch), C = 64 (channel), H = 56 (input-size), R = 5 (kernel-size), K = 128 (filter-number), U = 1 (stride-size). The choice of these parameters is reference to GoogLeNet [4, 8]. After that, we adjust the network parameters (N, H, R, K, U) and use variable-controlling approach to change one of them and keep the others constant. In this way, we can observe the performance changes with this parameter.

Table 1. System configuration CPU GPU Main memory GPU memory Operating system Framework Libraries

Intel Core i7-6700k (4.00 Ghz) NVIDIA GeForce GTX1080 8 GB 8 GB Ubuntu 16.04 LTS Caffe V1.0.0 CUDA 8.0; cuDNN 6.0; cuda-convnet2;

In the second experiment, we measure the execution speed of AlexNet in GPU environment using different algorithm scheduling strategy. Based on our previous experiments, we suggest possible optimization techniques to improve the speed of CNNs. We also design an optimized algorithm scheduling strategy, which assign the best suited implementations for AlexNet’s convolutional layers. We verify our strategy by comparing with Caffe+CUDA or cuDNN’s strategy. Our experiments are performed with a system described in Table 1, including the versions of the deep learning frameworks and libraries used. We use Caffe to build our CNNs model. Since cuDNN does not support direct convolution, we implement the direct convolution algorithm with cuda-convnet2.

202

R. Xu et al.

4 Comparison of Algorithms The characteristics of the algorithm are reflected by the execution efficiency under different conditions. In this section, we characterize the seven implementations (implicit GEMM, implicit-precomp GEMM, explicit GEMM, FFT, FFT-tiling, Winograd) of convolution algorithms in GPU environment. We measure the runtime and memory usage to compare the performance of seven implementations with respect to different size of input image, kernel size and stride-size. In this way, we show the influence of the network structure on the performance of the implementations. For convenience, we use GEMM1 as implicit GEMM, GEMM2 as implicitprecomp GEMM, GEMM3 as explicit GEMM, FFT1 as traditional FFT, and FFT2 as FFT-tiling.

7000

Memory (MB)

6000 5000 4000 3000

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

300 250

Run-time (ms)

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

8000

200 150 100

2000

50 1000

0

0 0

50

100

150

input-size

200

250

0

50

100

150

200

250

input-size

Fig. 4. The impact of the input size on performance

4.1

Input Size

Figure 4 shows the performance of all algorithms with different input sizes. For a small input size (20), the performance advantage of FFT2 is becoming more obvious. Winograd algorithm has similar runtime as FFT2, but it experiences an out of memory error when the input size is equal to 160. The runtime of FFT1 is fluctuant when the input size is around 64. The reason is that, for different input sizes, FFT1 will call different functions or libraries to calculate the Fourier transform, and one of the thresholds is 64. So FFT1 results in the worst performance in our experiment when the input size is 80. GEMM1 and GEMM2 still consume the least memory. GEMM3, FFT1 and Winograd algorithm experience the out of memory error when input sizes are equal to 100, 140 and 160 respectively. Interestingly, the memory usage of FFT2 is less than the direct algorithm when input size is greater than 80. The main reason is that these two algorithms use different acceleration libraries.

Accelerating CNNs Using Optimized Scheduling Strategy GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

7000 600

Runtime (ms)

500

400

300

6000 5000

Memory (MB)

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

203

4000 3000

200

2000

100

1000 0

0 0

5

10

15

20

25

30

35

40

45

50

0

55

5

10

15

20

25

30

35

40

45

50

55

Kernel-size

Kernel-size

Fig. 5. The impact of the kernel size on performance

4.2

Kernel Size

Figure 5 shows the performance of all algorithms with different kernel sizes. It is noted that Winograd algorithm only supports 3  3 and 5  5 kernels, so its runtime is reported with two dots in Fig. 5. In addition, the direct algorithm cannot support all given filter numbers in our experiment. For a small kernel size (kernel size < 5), the speed of GEMM2 is faster than FFT2, and Winograd algorithm has the similar runtime to GEMM2. But when the kernel size is greater than 5, FFT2 results in the best performance, and FFT1 is a bit slower than FFT2. Moreover, their runtime tends to be a constant value when the kernel size is smaller than 32. The reason is that the FFT algorithm need to do Fourier transform on the kernel, and the size of kernel is adjusted to the same as the input size. So, the kernel size basically has no effect on the calculation of FFT algorithm. Since FFT2 divides the input into 32  32 tiles [12], it experiences an error when kernel size is equal to 32. Interestingly, for GEMM algorithm, the trend of GEMM1 and GEMM2 runtime is arched. By calculating the number of multiplication (O(P2  R2), R = (H − P)/S + 1) of GEMM, a quadratic function is obtained, which is the same as the trend of GEMM runtime in Fig. 5. In memory usage, GEMM3 has the highest consumption, and it even experiences an out of memory error when kernel size is equal to 9. However, the memory usage of other algorithms is not affected by the kernel size, so their memory consumption is basically unchanged.

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

20

3500 3000

Memory (MB)

Run-time (ms)

15

10

GEMM1 GEMM2 GEMM3 FFT1 FFT2 WINOGRAD Direct

4000

2500 2000 1500

5 1000 500

0

0

0

2

4

6

Stride

8

10

12

0

2

4

6

8

Stride

Fig. 6. The impact of Stride on performance

10

12

204

4.3

R. Xu et al.

Stride Size

Figure 6 shows the performance of all algorithms with different stride sizes. Only the GEMM algorithm passes all tests, because FFT and Winograd algorithms only support stride size of 1, and the direct algorithm has upper bound on the stride size. When stride size is greater than 1, GEMM2 results in the best performance. It can be seen that the runtime and memory consumption curves of GEMM are hyperbolic. The stride size has impact on the number of data that needs to be processed, and with stride size larger, the amount of data less. In conclusion, FFT2 is the fastest implementation to train a CNN model with large kernel sizes (large than 20 in our experiment) and large input sizes (large than 5 in our experiment), due to its low arithmetic complexity and block operation. Winograd algorithm also has a similar performance, but considering the memory usage, we prefer to the FFT2. FFT1 is a bit slower than FFT2 when computing convolution with a large input size. But for a small input size (smaller than 20), FFT2 is slower than FFT1. For small kernel sizes and input sizes, Winograd algorithm and GEMM2 would be a good choice. But GEMM2 is more flexible than Winograd algorithm or FFT2, because Winograd algorithm only supports 3  3 and 5  5 kernels, and FFT2 or Winograd algorithm can only support the stride size as 1. Moreover, GEMM2 always occupy the minimum memory space, it is well suitable for cases when the memory is limited.

5 Optimized Scheduling Strategy As we have mentioned earlier, with same neural network structure, different implementations of convolution algorithms often have different performance. So, the diversity of DCNN’s layer sizes and the different performance of implementations demand an efficient scheduling strategy to assign the appropriate implementation for each convolutional layer. In this way, we can optimize both power efficiency and performance. By analyzing our experimental data, we propose an optimized algorithm scheduling strategy. The strategy completes the algorithm selection according to the structure parameters of the current convolutional layer of neural networks. For each layer, the strategy read the model parameter file to obtain the input mapping data structure , the weight data structure , and the stride-size U. After that, the strategy will examine the parameters U, H, and R, respectively. So that’s the basic flow (see Fig. 7): When U is greater than 1, our strategy arranges implicit-precomp GEMM as the convolutional implementation for the current convolutional layer. According to our experimental results, this implementation works best in this case. But if U equals 1, we examine the value of H. If H is greater than 16, our strategy assigns FFT-tiling for the current layer as the implementation. According to our characteristic analysis of FFT-tiling, this implementation can gain better performance than others in the case of H > 16. If H is less than or equal to 16, Our experiments prove that FFT-tiling is not the best choice at this time, and our strategy will re-select an implementation according to the value of R.

Accelerating CNNs Using Optimized Scheduling Strategy

205

Fig. 7. The workflow and pseudocode of Optimized Algorithm Scheduling Strategy

When R is equal to 3 or 5, our strategy will arrange the Winograd as the implementation for the current layer. But if R is not equal to 3 or 5, according to our experimental results, the FFT implementation is the best choice. With this strategy, we can obtain the best implementation for each convolutional layer with the optimal performance in GPU environment. By reducing the execution time, we also reduce energy consumption. The workflow also shows that our strategy involves only the structural parameters of the network and does not care about the operation data or the process during execution of CNNs. This feature allows us that we can execute our strategy in advance, so that the operation process of CNNs will not be affected. In order to verify our strategy, we compare it with two scheduling of Caffe+CUDA and Caffe+cuDNN. The Caffe+CUDA solution is rely on the GEMM algorithm and uses CUDA library to accelerate the convolution operation in the GPU environment. There is no algorithm scheduling in this solution. The Caffe+cuDNN solution uses multiple algorithms to accelerate CNNs. It uses cudnnGetConvolutionAlgorithm(), which serves as a heuristic for seeking the suitable algorithm for cuDNN-convolution for the given layer specifications. In our experiments, we use these three strategies to accelerate the AlexNet network and measure their execution time respectively. The experimental results are shown in Fig. 8. From the experimental data, it can be seen that the speed of Caffe+CUDA is the slowest. Because Caffe+CUDA only uses GEMM algorithm, which is inefficient to execute AlexNet with different layer structure. In contrast to Caffe+CUDA, Caffe +cuDNN has a variety of convolutional algorithms. It chooses the appropriate algorithm or implementation to accelerate each convolutional layer of AlexNet. In this way, it further improves the computational efficiency of CNNs. In experiments, Caffe +cuDNN increases the speed by 2.3 than Caffe+CUDA. According to the data structure parameters of the neural networks, our solution arranges the most suitable

206

R. Xu et al.

Table 2. Convolutional algorithms arranged by different strategies for each convolutional layer of AlexNet in GPU environment. AlexNet

conv1

conv2

conv3

conv4

conv5

config

Stride = 4 GEMM

Stride = 1 GEMM

Stride = 1 GEMM

Stride = 1 GEMM

Stride = 1 GEMM

GEMMa

GEMMa

Winograd

Winograd

Winograd

GEMMa

FFT-tiling

Winogradb

Winograd

Winograd

Caffe +CUDA Caffe +cuDNN Our strategy a

GEMM : implicit-precomp GEMM Winogradb: another implementation of Winograd

convolution algorithm for each convolutional layer of AlexNet, so as to achieve the maximum acceleration effect. Our strategy increases the speed by 2.3 than Caffe +CUDA. Meanwhile, it is 1.2x faster than the Caffe+cuDNN. In order to further explore the differences between these strategies, we record the algorithms, respectively, that they choose for each layer of AlexNet (Table 2). In conv1, conv4 and conv5, both our strategy and the Caffe+cuDNN chooses the same implementation to speed up convolutional operations. But in conv2, Caffe +cuDNN chooses precomp-implicit GEMM, and our strategy chose FFT-tiling. In comparison, our strategy, which increases the speed of convolutional operations by 40% than Caffe+cuDNN, is more efficient. Similarly, in conv3, our strategy chooses another implementation of the Winograd algorithm, making it 10% faster than Caffe +cuDNN.

300

1.0x

Runtime/ms

250 200 150

2.3x 100

2.8x

50 0

Caffe+CUDA

Caffe+cuDNN

Our strategy

Fig. 8. The execution time of AlexNet using different strategies in GPU environment

Our experiments show that our strategy is better than Caffe+CUDA or Caffe +cuDNN. It should be noted that our strategy only needs to read the structural parameters of the current layer of the CNNs network, and it has nothing to do with the data actually participating in the calculation. In this way, we can use our strategy in advance to rationalize the calculations for each convolutional layer in the CNNs network. Our strategy does not affect the actual execution time of CNNs.

Accelerating CNNs Using Optimized Scheduling Strategy

207

6 Conclusion The convolutional neural network has become a hot topic in current research. Our work is aimed at comparing the performance of popular convolution algorithms in the GPU environment with the same framework. Based on our experiment, we find that choosing convolution algorithms carefully can make a CNN model faster in executing convolution layers. For this reason, we propose an optimized algorithm scheduling strategy, which can assign the best implementation for each convolutional layer. This strategy is simple and does not affect the implementation of CNNs. Experiments show that using our strategy can speed up the execution of the CNNs model from 1.2x to 2.8x.

References 1. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 2. Sze, V., et al.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017) 3. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(2), 84 (2012) 4. Simard, P., Lecun, Y., Denker, J.S.: Efficient pattern recognition using a new transformation distance. In: Advances in Neural Information Processing Systems (NIPS 1992), pp. 50–58 (1992) 5. Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. Computer Science (2014) 6. Mathieu, M., Henaff, M., Lecun, Y.: Fast training of convolutional networks through FFTs. Eprint Arxiv (2013) 7. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks, pp. 4013–4021. Computer Science (2015) 8. Cheng, J., Grossman, M., Mckercher, T.: Professional CUDA C Programming. Wiley, New York (2014) 9. Jia, Y., et al.: Caffe: Convolutional Architecture for Fast Feature Embedding, pp. 675–678 (2014) 10. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53 11. Krizhevsky, A.: cuda-convnet2 (2014). https://github.com/akrizhevsky/cuda-convnet2/ 12. NVIDIA: CUDNN User Guide (2017). https://developer.nvidia.com 13. Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. Statistics (2015) 14. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011) 15. Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition, pp. 1–9. IEEE (2015) 16. Lecun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (2014) 17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014) 18. He, K., et al.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)

208

R. Xu et al.

19. Vasilache, N., Johnson, J., Mathieu, M., et al.: Fast convolutional nets with FBFFT: a GPU performance evaluation (2014) 20. Li, X., et al.: Performance analysis of GPU-based convolutional neural networks. In: International Conference on Parallel Processing, pp. 67–76. IEEE (2016) 21. Kim, H., et al.: Performance analysis of CNN frameworks for GPUs. In: IEEE International Symposium on PERFORMANCE Analysis of Systems and Software, pp. 55–64. IEEE (2017)

Data Analysis of Blended Learning in Python Programming Qian Chu1,2 , Xiaomei Yu1,2(B) , Yuli Jiang1,2 , and Hong Wang1,2 1

Institute of Information and Engineer, Shandong Normal University, Jinan, China [email protected] 2 Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Jinan 250014, Shandong, China

Abstract. The rapid emergence of blended learning has sparked a great deal of research interest in the field of educational data mining. We apply the novel educational form of blended learning in the undergraduate curriculum of python programming. With the questionnaire before curriculum is obtained to capture the basic information of undergraduate students, we design educational resources and activities for online studying and face-to-face teaching. Since the learning process of each student is captured continuously, we make teaching and learning evaluations weekly to improve current teaching methods hence arouse students’ interest of continuous learning. With analyzing data and mining knowledge received in the process of blended learning, some beneficial results are gained to promote the quality of blended learning in the undergraduate curriculum of python programming, and benefit the undergraduate students as well as higher education in the long run.

Keywords: Blended learning

1

· Education · Python · Data analysis

Introduction

Blended learning is an effective teaching approach where learning occurs both online and face-to-face, with the purpose to capture educational strengths both on the internet and in the classrooms. This novel teaching form has opened up a new era of education by pushing the so-called “forced-feeding method of teaching” to a blended learning form with both online video learning and flipped classroom. On one hand, offloading lecture time to video makes it possible for the teachers to spend more time interacting with students in class. On the other hand, the flipped classroom actually enhances the oversight, and promotes the students to taking part in class activities. Therefore, we introduce blended learning into the undergraduate curriculum of python programming. Supported by the National Nature Science Foundation of China (No. 61672329, No. 61773246), Shandong Normal University’s Educational Project for Blended Learning (No. 2016KG79, No. 2016JG54). c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 209–217, 2018. https://doi.org/10.1007/978-3-030-05057-3_16

210

Q. Chu et al.

In fact, most first year students in the university are required to take a semester course on computer science, a large portion of which is based on their interests and background with computer science. The purpose of the course is twofold: (1) to introduce the programming language of python to freshmen and show them common data analysis and machine learning libraries in python as well as their uses; (2) to teach them the basics of programming and introduce the engineering problem-solving methodology. Considering that the students enrolled are from diverse majors and show sharp difference in their motive intensity and their study attitude, we carry out a simple questionnaire before curriculum to capture the basic information of undergraduate students. Then we design educational resources and activities for online studying and face-toface teaching as well as for selective and personalized study. Since the learning process of each student is captured continuously, we make teaching and learning evaluations weekly to improve current teaching methods hence arouse students’ interest of continuous learning. Finally, data preprocess such as pretreatment and standardization is done on data collected in the process of blended learning, with the purpose to promote quality of blended learning in the undergraduate curriculum of python programming. The main contributions of this paper are outlined as follows: – The questionnaire before and after curriculum is designed to capture the basic information of the undergraduate students and further improve the blended learning in the follow weeks. – Creative work including personalized teaching video and educational resources is produced to benefit the whole blended learning process. – With analyzing data and mining knowledge received in the process of blended learning, some beneficial results are gained to promote quality of blended learning in the next semester. The remainder of this paper is organized as follows. In Sect. 2, we review related work on blended learning. After address some relevant problem, we present our blended learning process in Sect. 3. With analyzing data and mining knowledge in blended learning, Sect. 4 describes the methods of teaching and learning evaluations. Finally, some conclusions are outlined and future work is presented in Sect. 5.

2

Related Work

Blended learning is a learning approach that retains the values of traditional education while incorporates advanced educational technologies. In 2008, Garrison and Vaughan introduced three cases of mixed learning. In a small class of politics, blended learning enables students to gain more meaningful experience. In a large class of chemistry, blended learning is used to increase teacher-student interaction and improve problem-solving skills [1]. Graham et al. put forward a framework for adoption and implementation of blended learning based on the investigation in six American universities. They are divided into three categories:

Data Analysis of Blended Learning in Python Programming

211

awareness or exploration, adoption early implementation and mature implementation growth [2]. In general, blended teaching can be understood in such a way: the students acquire knowledge not only in the classroom, but also have online courses in extracurricular time, so they have the freedom to control the progress of learning themselves. With the two study phases connected as a whole, the students’ individual learning needs are met. In 2003, professor He proposed the concept of blended learning in China firstly. Combining the advantages of traditional learning and e-learning, blended learning further emphasizes the role of teachers’ leading guidance as well as the dominant position of student’s studying abilities. The theory of blended learning causes a revolution in the education sectors [3]. In 2006, Huang et al. addressed the way of using appropriate technology and means to impart knowledge to students, so as to optimize the effect of learning [4]. Professor Li Kedong believe that blended learning can be applied on the appropriate teaching platform and media, so as to benefit the effect of learning [5]. The theory of blended learning has been continuously improving and developing.

3

Practice of Blended Learning in Python Programming

In order to follow the idea of “Internet + education” and improve the quality of teaching, we choose the “python programming” as a practical object in the elective course at colleges, then introduce the blended learning to explore the effect of teaching in practical courses. 3.1

Analysis Before Class

Before implementing the form of blended learning, we design the questionnaire, with the aim to fully know student’s basic information, such as learning ability, cognition, and so on. There are 98 copies are issued, and all of the copies are actually received, so the effective rate is 100%. Basic Information About Students. The objects are four-year undergraduate students from different grades and schools, and the course of python programming is organized for a mixed class. The student’s cognitive style, computer knowledge and learning habits are different, that is to say, the basic information of students should not only conducive to their aptitude, but also conducive to a reasonable teaching plan in content. The students in the class come from 12 different colleges in this university, including the school of Information Science and Engineering, the school of Music, the school of Physics and Electronic Sciences, etc. The analysis of the basic information of students is shown in Table 1. As the proportion of boys and girls is close to 1:1, the teachers have to take the logical differences between boys and girls into account. In fact, the boys have some advantages in practical abilities, while the girls have more advantages on careful thinking, which make it possible for the teachers to strengthen

212

Q. Chu et al.

unity and cooperation among students, so as to promote the advantages of complementarity. Based on the actual situation, the curriculum designed for python programming is as follows: the content of the course emphasizes basic theoretical knowledge and important technical practice; the face to face teaching adopts the form of case-driven teaching and group collaboration methods; the online teaching content adopts the task learning list which is driven by the combination of a variety of learning forms, including online video, knowledge point testing, projects and so on. Table 1. Basic information about students. Grade

Gender School Computer Management Others

Freshman

Male 13 Female 2

3 15

12 27

Sophomore Male 13 Female 0

0 0

3 9

Students’ Level of Basic Knowledge. There are differences in the basis of students’ computer knowledge. For example, Question-Whether you have ever studied a programming course? It can be seen that the proportion of those who studied programming and those who have never studied programming is close to 1:1, which means that nearly half of the students have blank knowledge or have a very weak foundation for programming. Continuing with Question-What do you think of your ability to program now? The results are shown in Fig. 1, from level 1 to level 5, the ability is decreasing. Accounting to about 50% of the total, the proportion of the students with poor programming ability is about 33%. Nearly 90% of the students have poor programming skills. Maybe they consider that this course is too hard to understand. Therefore, teachers may explain to them that the purpose of this course is to teach basic application, so as to enhance the self-confidence learning in studying such a course. Students’ Attitude to Blended Learning. Blended learning differs from traditional teaching. Prior to class, teachers should be aware of students’ attitudes about blended learning. For Question-Which teaching method do you prefer? The result is shown that more than half of the students tend to blended learning model, what’ more, for Question-Do you support learning the course in a blended learning fashion? 82% of the students support the method of blended learning, only a few students hesitant or hold opposition attitude, indicating that most students have a positive attitude towards mixed-type courses, which is important in the process of mixed teaching.

Data Analysis of Blended Learning in Python Programming

213

Fig. 1. The feedback students’ ability for blended learning

From the above it can be seen that there is much differences in the basis in students’ abilities in programming, but students have a strong motivation to learn and most students are willing to try this new approach in the curriculum. 3.2

Resources and Activities Design

Selecting the appropriate learning platform helps teachers control the entire teaching process better. From the studies of Xu [6], the online learning platform for blended learning is increasingly being transformed from the formal platform to the informal platform where the informal platform is more personalized according to the truly customized teaching situation. In the course of python programming, we select Superstar online as our blended learning platform. Teaching resources design is essential to rich the learning resources in online platform. Teachers add modules and upload learning resources to Superstar learning platform, which meets the individual teaching needs of teachers, while the Micro lesson videos are recorded. Compared with the existing open class or excellent courses on the internet, 8 to 10 min of high efficiency micro-class is benefit to make a deep impression on students. Moreover, homemade microclasses can satisfy the needs of students more accurately, and flexibly adjust the learning content by students themselves. The python programming resources online includes general introduction to python, python basic operation, python data structure, python data reading and analysis data, project practice, and etc. Weekly course is equipped with a sheet on autonomy learning task to give guidance in learning. Other modules such as the homework module and QA module are set to meet our needs. In the course of blended learning, information technology is regards as a means of teaching, and the teachers would not be replaced by technical means [7]. In teaching activities, learning activities for students mainly include collaborative learning, self-learning, physical learning, practical learning and so on. In addition, interaction among students and teachers are also included. Such a learning style is an indispensable part, though it is more informal, for online forums play an important role in necessary communication. Moreover, the teachers obtain a large amount of information from the forums. For example, students’

214

Q. Chu et al.

understanding and opinion about mixed course. In this study, the forum provides online communication for teachers and students, as well as students and students. Teachers improve the management and resources on platform, monitoring and regulating the learning activities in class. After each class, the teachers design a micro-questionnaire to track students’ learning and provide suggest for the next class. The micro-questionnaire takes two or three questions as the standard, in order to obtain the learning effect and the possible problems in teaching. 3.3

Blended Learning Process

With the development of Internet and educational technologies, blended learning is better to meet the study needs of students. Before the class, the teachers place recorded video which is necessary on the teaching platform; students can control the learning process by themselves. When difficult points are met, students can watch the video repeatedly. In the face-to-face teaching, teachers in the classroom focus on the key point in learning. On problems that arise in practice, the students communicate with their classmates or ask teachers in their class time. In this way, the students’ abilities of hands-on in practice are strengthened. Compared with traditional teaching, the students’ passive acceptance of the rigid knowledge in the book is changed so as to achieve the skills and abilities of being practical. For each session in python programming, students autonomously study on the teaching platform, refer to the autonomous learning quiz, watching instructional videos, etc. As to the questionable points of knowledge, they discuss with classmates or teachers in the discussion form. Generally speaking, a group consists of five or six students, and the members of the group work together. Moreover, they would communicate with each other among different groups. The form of group learning enables everyone to participate in the learning activities, which greatly stimulates students’ enthusiasm in learning. The teachers guide the students to focus on the memory of knowledge points and solve the students’ doubts, using a task-driven approach to enable students to “do” the theory and practice linked. As a result, A project or some small exercises are selective to encourage students to demonstrate their learning results in the classroom, and the classroom is left to demonstrate their experiments and experience in using python if necessary.

4

The Analysis of Evaluation Results

In modern education, the public pay more attention to the performance of students in practice and prefer multiple evaluations. Therefore, the goal of teaching is changed to “promote the all-round development of students”. 4.1

Subjective Teaching Effectiveness Survey

After the course, questionnaires and interviews are applied to have an insight on students’ thinking about blended learning. From the students’ evaluation for

Data Analysis of Blended Learning in Python Programming

215

the course, it can be seen that the improvement of students’ learning ability, as well as the outstanding scores of students’ participating in competitions are achieved. Moveover, the blended learning bring advantages in learning habits for both teachers and students. Using the novel form of blended learning, the cooperative ability of students is improved. About 89% of the students think that their abilities are improved. Then the teachers communicate with the rest part of students to find out possible reasons that affect their ability to enhance. These factors are the key points for improving blended learning. As to the question of whether or not to support blended learning, 49% of the students chose strong supporting and 47% of them just support blended learning, while only 4% of them oppose it. Compared with the premixed questionnaire survey, 35% of the students strongly support it, and 45% of them just support it, while 20% of them oppose it. It can be seen that the overall number of students who support it rising markedly and the number of students who oppose it is remarkably decreased. From Question-with blended learning, what are you most benefits from the course? 40% of the students gain knowledge about python and improve their ability to programming, 22% of them increase their ability to collaborate, and 24% increase their ability to discover problem and the ability to solve problems, 14% of the students get more interest in learning. In all, these factors promote the improvement of the overall abilities of the students. In Question-whether the blended learning activities boost your learning in the python class? 53% of the students think it very helpful, 38% consider it helpful, which shows that blended teaching helps student learn better. The results are shown in Fig. 2.

Fig. 2. The analysis of evaluation results

Can you finish the learning tasks easily and in time? As to this question, about 30% of the students delay the completion of the task, or can’t complete the task of learning on time, which indicates that the learning efficiency of students is not high. On the main factors that affect students’ completing the task of learning, 31% of the students think it is inconvenient to complete the task, while 17% of students think it inconvenient to use computer network, and 30% of the students think that they lack of good learning habits. In order to enhance the responsibility of the students in the course, the teachers should adjust the

216

Q. Chu et al.

learning tasks to a reasonable level and make a more humanized arrangement to ensure that the students have enough time to complete the task. In fact, about 54% of the students believe that the blended learning put forward higher requirements to teaching and learning, which requires the cooperation in teachers and students to coordinate learning activities and improve their adaptability to blended learning. 4.2

Objective Teaching Effect Survey

After a semester of blended learning activities, the teaching platform recorded the students’ performance of learning activities in detail, and obtained the data such as the number of visiting, students’ scores, the duration of the videos, the number of discussions, the chapter tests and so on. With the data obtained, data process such as pretreatment and standardization is applied to obtain valuable information on blended learning. As for the interview period, 73.44% of the students visited the learning platform from 16.00 pm to 24.00 pm. Thus, most of the students took advantage of the one-day course after completing their daytime course, and used free time in evening to study. They visit the learning platform in dormitory or study room to review or preview the course, which means the students’ online learning are achieved mainly by computers. Therefore, the next improvement is to extend a more flexible fashion of mobile terminals for blended learning.

5

Summary

In this study, a semester’ teaching with the novel form of blended learning is introduced and improvements on blended learning is conducted, which enrich the teaching methods and achieve better-than-expected teaching effects in the undergraduate curriculum of python programming. However, the research and implementation of blended learning is not mature and there is still rooms for improvement in our following study.

References 1. Garrison, D.R., Vaughan, N.D.: Blended Learning in Higher Education: Framework, Principles, and Guidelines. Wiley, New York (2008) 2. Graham, C.R., Woodfield, W., Harrison, J.B.: A framework for institutional adoption and implementation of blended learning in higher education. Internet High. Educ. 18, 4–14 (2013) 3. Kukang, H.: From blending learning to see the new development of educational technology theory (Part Two). China Electrochem. Educ. 3, 5–10 (2004) 4. Ronghuai, H.: The Theory and Practice of Blended Learning, pp. 33–35. Higher Education Press, Beijing (2006) 5. Li, K., Zhao, J.: The principle and application of hybrid learning. Electr. Educ. Res. 2(7), 1–6 (2004)

Data Analysis of Blended Learning in Python Programming

217

6. Meidan, X.: Study and Design of Mixed Learning Based on WeChat Public Platform. Normal University, Nanjing (2016) 7. Cheng, G., Chau, J.: Exploring the relationships between learning styles, online participation, learning achievement and course satisfaction: an empirical study of a blended learning course. Br. J. Educ. Technol. 47(2), 257–278 (2016)

APs Deployment Optimization for Indoor Fingerprint Positioning with Adaptive Particle Swarm Algorithm Jianhui Zhao1, Jun Li1, Haojun Ai1,2, and Bo Cai1(&) 1

2

School of Computer Science, Wuhan University, Wuhan 430072, Hubei, China [email protected] Collaborative Innovation Center of Geospatial Technology, Wuhan 430079, Hubei, China

Abstract. Indoor positioning service gives people much better convenience, but its efficiency is affected by the spatial deployment of access points, APs. We propose an algorithm from adaptive particle swarm, APS, and then apply it in APs deployment optimization for fingerprint based indoor positioning. In our method, solutions of APs placement are taken as individuals of one population. Particle swarm method is improved with adaptive technology to ensure the population diversity and also avoid large number of inferior particles. After evolutions, the optimal result is obtained, corresponding to the best solution of APs deployment. The algorithm works well for both single-objective and multiobjective optimizations. Experiments with deployments of 107 iBeacons have been tested in an underground parking lot. Compared with the existing APs placement methods, our APS algorithm can obtain the least indoor positioning error with fixed APs number, while receive the best integrated evaluation considering both positioning error and APs cost with unfixed APs number. The proposed algorithm is easily popularized to the other kinds of indoor spaces and different types of signal sources. Keywords: Indoor positioning Adaptive particle swarm

 APs deployment  Optimization algorithm

1 Introduction Positioning technology can be used in outdoor and indoor environments. Outdoor positioning technology includes GPS, Galileo, Beidou navigation satellite system, etc. However, satellite signal attenuates seriously when it penetrates building, while the complex indoor environment causes further signal attenuation, thus it is impossible to achieve indoor positioning with satellite signal. Currently, there is an increasing demand for indoor positioning, e.g., locating one person in a building, floor, or even room; finding certain good in the warehouse; positioning your wallet, key in an office; and so on. Indoor positioning can greatly facilitate people’s work and life, and it has received more attentions from users and researchers [1, 2]. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 218–228, 2018. https://doi.org/10.1007/978-3-030-05057-3_17

APs Deployment Optimization for Indoor Fingerprint Positioning

219

The indoor positioning system may use different types of signal sources [3], such as WiFi, Bluetooth, UWB, LED, Ultrasonic, RFID, Infrared, ZigBee, etc. According to the positioning algorithm, indoor positioning mainly includes cell-ID positioning, triangle positioning, multilateral positioning and fingerprint positioning methods [4]. Cell-ID positioning and triangle positioning methods cannot guarantee the accuracy of positioning. Multilateral positioning method has theoretically high positioning accuracy, but it is difficult to obtain time and angle parameters from the ordinary equipment. Fingerprint based positioning collects signal characteristic parameters of different positions in space, and establishes the fingerprint database, then the target position is determined by comparing the received signals with signal characteristic parameters in database. Therefore, fingerprinting method is usually used for indoor positioning. Ma et al. [5] presented an indoor positioning system based on intelligent fingerprint assisted method to calculate the reliable reference point data and develop a learning mechanism through wireless network connection. Xia et al. [6] developed a method of processing the off-line data and an improved KNN positioning method to improve the positioning precision based on fingerprint. Raspopoulos [7] studied the use of deterministic channel modeling through 3D Ray Tracing for constructing the device independent radio maps for WiFi RSSI-based fingerprinting indoor positioning system, which is applicable to different devices. Deploying the signal sources (access points, APs) properly in advance is very important for indoor positioning technology. There are two types of APs deployment approaches: non-optimal and optimization methods, while the non-optimal method mainly refers to uniform deployment. Traditional APs deployment usually adopts the uniform method, i.e., signal sources are evenly distributed in one space. But there may be less accurate positioning result with too few APs, or unnecessary waste with too many APs. In order to achieve both precise indoor positioning and low cost, APs deployment should be optimized. Maximum and Minimum Coverage (MMC) method proposed by Dhillon and Chakrabarty [8] uses the polynomial-time algorithms to determine the number of sensors and their placement to help address the coverage optimization under constraints of imprecise detections and terrain properties. Based on Cramer-Rao Lower Bound (CRLB) and Simulated Annealing (SA), Zhou et al. [9] presented an optimization method for APs placement, which focuses on the error bound analysis of indoor WiFi fingerprint based positioning for intelligent APs placement using Fisher Information Matrix (FIM) to characterize the relationship between positioning errors and signal distributions. Besides, there are other kinds of complex optimization methods [10, 11], e.g., genetic algorithm, artificial immune algorithm, particle swarm optimization, etc., and they can also be used for APs deployment optimization of indoor positioning. Particle Swarm Optimization (PSO) has been widely utilized in the fields of neural network training, function optimization and fuzzy system control. For optimization, PSO has good search depth, but its search breadth is insufficient [12]. Therefore, an adaptive particle swarm algorithm, APS, is proposed to optimize APs deployment for indoor fingerprint positioning. The APS can improve the breadth searching ability of traditional PSO, and can generate better global optimization. Compared with existing

220

J. Zhao et al.

optimization algorithms, our proposed method can obtain more optimal indoor APs placement, including both single-objective (e.g., indoor positioning error) and multiobjective (e.g., positioning error and APs cost) evolutions.

2 Our Algorithm for Indoor APs Deployment In our work, fingerprint based positioning method is used for indoor location, while positioning error and APs cost are mainly considered in defining the objective function. The adaptive PSO, APS, is implemented for spatial optimization of APs. 2.1

Objective Function and Fingerprint Positioning

To test the efficiency of different optimization algorithms for APs deployment, APs are initially placed on the reference points in indoor space. Then some of the APs are chosen as one possible solution of deployment optimization, and their locations are estimated with fingerprint positioning method. The differences between reference points and estimated points are calculated, and taken as positioning error for the selected APs. Considering the positioning error, or considering both the positioning error and the cost of APs, the objective function value can be computed, and then used to evaluate the spatial deployment of APs. With the help of optimization algorithm, the optimal APs deployment is obtained after iterations of searching. Given indoor space and APs parameters, the optimization algorithm can provide installing suggestion for signal sources. Suppose there are n APs in indoor place, and m of them are selected as one deployment. The chosen APs are evaluated as follows. 1. Place all the n APs evenly in the building for indoor positioning, i.e., their row spacing and column spacing are both k meters, and coordinates of all the APs are recorded as reference points in a database. 2. For the selected m APs, their coordinates or the corresponding m reference points are retrieved from the database. 3. Signal values from APs are received at each reference point, and are recorded by mobile phone as fingerprints in a 2D array, i.e., data in the jth column and the ith row is the signal value of the jth AP collected at the ith reference point. 4. Coordinates are estimated for each of the selected m APs, and then are recorded as their estimated points. 5. As one solution of APs spatial deployment, the selected m APs are evaluated by the unified objective function: OF ¼ a  APE þ b  COA m P

APE ¼ i¼1

ð1Þ

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxei  xri Þ2 þ ðyei  yri Þ2 m

ð2Þ

APs Deployment Optimization for Indoor Fingerprint Positioning

221

where OF is the objective function, while APE is the averaged positioning error and COA is the cost of APs, a and b are their weighting parameters, ðxri ; yri Þ is coordinates of the ith reference point, ðxei ; yei Þ is coordinates of the ith estimated point. Obviously, the smaller value of objective function means the better APs placement, i.e., the deployment with less positioning error and APs cost. The average positioning error is obtained through fingerprint positioning, and the method for obtaining the estimated coordinates of the ith reference point is described as follows. 1. For each of the m APs, compare its fingerprint with every fingerprint already recorded in the database, i.e., calculate the Euclidean distance between two fingerprints. 2. Find the 3 fingerprints with the smaller Euclidean distances w1, w2 and w3 from database, then obtain coordinates of the 3 corresponding reference points (x1, y1), (x2, y2), (x3, y3). 3. The estimated coordinates of the ith reference point is computed as: xei ¼ ððx1=w1Þ þ ðx2=w2Þ þ ðx3=w3ÞÞ=ðð1=w1Þ þ ð1=w2Þ þ ð1=w3ÞÞ

ð3Þ

yei ¼ ððy1=w1Þ þ ðy2=w2Þ þ ðy3=w3ÞÞ=ðð1=w1Þ þ ð1=w2Þ þ ð1=w3ÞÞ

ð4Þ

where coordinates of the ith estimated point are calculated from 3 reference points with similar fingerprints, and the more similar reference point has the bigger effect by using the reciprocal of Euclidean distance as weighting parameter. For the application of indoor positioning, position of mobile phone is estimated in the same way. In our work, the mobile phone is supposed to be placed at each location of APs one by one. Thus the reference points and their estimated points can be used to define the positioning error of the related deployment. 2.2

Implementation of Adaptive PSO

PSO algorithm is easy to fall into the local optimum, so we improve the optimization algorithm to increase its breadth searching ability. The improved PSO is named as Adaptive Particle Swarm (APS) algorithm, which can ensure the population diversity and avoid the introduction of large number of inferior particles. The basic idea of APS is: (1) set up a threshold; (2) take one particle as an excellent particle if its objective function value is no more than the threshold, otherwise take it as an inferior particle; (3) increase the threshold adaptively if the number of excellent particles is too small, or reduce the threshold adaptively if the number of them is too big. The proposed APS is used for the problem of indoor positioning APs placement optimization, and the procedure is as follows.

222

J. Zhao et al.

1. Initialize the population of particles, and calculate the objective function value of each particle, the specific process is described in Pseudocode 1. 2. Initialize the history optimal value “pbest” for each particle, and initialize the population optimal value “gbest” for all particles, i.e., “gbest” is the least value of all “pbest” in the same generation. 3. Adaptively adjust the particles to obtain better optimization result, and the specific process is described in Pseudocode 2. 4. Update the individual history optimal value “pbest” with the least objective function value for each particle, and the population optimal value “gbest” with the least objective function value for all particles. 5. Take “gbest” as result if the maximum number of iterations is reached or if “gbest” satisfies the requirement, and take the corresponding particle with “gbest” as the optimal solution of APs deployment; otherwise, go to Step (3).

For each particle, sizepop is the population size, num[i] is the number of APs in the ith particle, and fitness[i] is the objective function value of the ith particle. [Xmin, Xmax] is the x coordinate range of indoor space, [Ymin, Ymax] is the y coordinate range of indoor space, and [Vmin, Vmax] is the velocity range of particles. Based on the coordinates, the corresponding AP is determined, thus our method can deal with randomly labeled APs in indoor space.

APs Deployment Optimization for Indoor Fingerprint Positioning

223

The threshold Tgood is predefined to decide one particle as an excellent or inferior particle. The number of excellent particles is goodnum, while Nmax is the maximum number of excellent particles and Nmin is the minimum number of excellent particles. Values v1/v2 are defined to decrease/increase the threshold Tgood, and we set v1 > v2 to avoid the large number of inferior particles.

3 Experimental Results and Analysis Based on experiments of APs deployment for fingerprint based indoor positioning, efficiency of our APS algorithm is analyzed. Our approach is also compared with the existing optimization methods, including Maximum and Minimum Coverage (MMC) based method, Cramer-Rao Lower Bound (CRLB) based method, Genetic Algorithm (GA) and Artificial Immune Algorithm (AIA). 3.1

Testing Environment

In our work, all the algorithms are tested in the positioning space of one underground parking lot, as shown in Fig. 1. Taking iBeasons as APs, there are 107 signal sources.

224

J. Zhao et al.

The APs are evenly deployed in the indoor parking lot, with row spacing and column spacing about 4.5 m.

p

g

Fig. 1. Positioning space of underground parking lot with 107 iBeacons

Fingerprint database is acquired as follows: take the 107 APs locations as reference points; at each reference point use an Android mobile phone orientated to the same direction to collect the received signal strength from each AP; obtain 600 sets of data from each reference point within 1 min with once collection every 100 ms; store the collected data of each reference point as one XML file; calculate the average value of 600 sets of data, and take it as fingerprint of the reference point. Our experiments are executed in a computer with Inter (R) Core i5-6400 processor, 2.70 GHz CPU, 8 GB memory, NVIDIA GeForce GTX 1050 Ti graphics card, Win7 64-bit operating system, and VS2015. 3.2

Parameters of Each Algorithm

Based on the positioning space of underground parking lot, the parameters of our APS algorithm are set as: Xmin is 4832, Xmax is 4863, Ymin is 5353, Ymax is 5449, Vmin is −30 (speed in the negative direction), Vmax is 30 (speed in the positive direction), the initial Tgood is 7.1, Nmin is 5, Nmax is 20, v1 is 0.05, v2 is 0.03, c1 and c2 are 2.0, smin is 0, smax is 10, rxinit is 15.0, rxfinal is 5.0, ryinit is 45.0, ryfinal is 5.0. The parameters of MMC based algorithm are the same as Reference [8]. The parameters of CRLB based algorithm are the same as Reference [9]. The parameters of GA algorithm are set as: the occurrence probability of the cross operations is 0.5, the occurrence probability of the mutation operations is 0.2. The parameters of AIA algorithm are set as: the cross mutation rate is 0.85, the mutation rate of a single gene is 0.65, the parameter of diversity evaluation is 0.95. For all the above optimization algorithms, the population size is 50, the maximum number of iterations is 80, the maximum number of APs is 107. When number of APs is fixed, the objective function only considers the average positioning error, which is:

APs Deployment Optimization for Indoor Fingerprint Positioning

OF ¼ 1:0  APE þ 0:0  COA

225

ð5Þ

When number of signal sources is unfixed, the objective function considers both the average positioning error and the cost of APs, i.e., the number of APs since they are all iBeasons in our experiments, which is: OF ¼ 1:0  APE þ 0:075  COA

3.3

ð6Þ

Performances with Fixed APs Number

When the number of APs is fixed, all algorithms are compared only considering the average indoor positioning error for the deployment optimization. There are 88 tests performed for each algorithm, corresponding to the number of APs from 20 to 107. As illustrated in Fig. 2, our combined algorithm obtains the minimum location error with the same number of APs as other methods, and thus gives the best deployment of APs. From the experiment result, performances of all related algorithms are ordered ascendingly as: CRLB, MMC, GA, AIA, APS.

Fig. 2. Positioning errors from MMC, CRLB, GA, AIA, APS

3.4

Performances with Unfixed APs Number

When the number of APs is unfixed, all algorithms are compared considering both the average positioning error and the cost of APs (i.e., the number of iBeasons) for deployment optimization. There are 10 tests for each algorithm, and the results are shown in Fig. 3. Obviously, APS algorithm obtains the minimum objective function value or the best integrated evaluation of positioning error and APs cost, and thus gives the best deployment of iBeasons.

226

J. Zhao et al.

Fig. 3. Integrated evaluations of positioning error and APs cost from MMC, CRLB, GA, AIA, APS

Then the minimum, average and maximum integrated objective function values of the 10 tests are listed in Table 1 for all the methods. It can be found that the proposed APS algorithm has all best performances in minimum, average, maximum integrated objective function values. Based on the average integrated evaluation, performances of all algorithms are ordered ascendingly as: MMC, GA, CRLB, AIA, APS.

Table 1. Minimum, average and maximum integrated evaluations from MMC, CRLB, GA, AIA, APS Algorithms MMC CRLB GA AIA APS

Minimum integrated evaluation 6.52489 6.15706 6.25092 5.88511 5.73439

Average integrated evaluation 6.52489 6.35707 6.38989 6.09539 5.85493

Maximum integrated evaluation 6.52489 6.63497 6.50733 6.38688 5.95164

According to the above experiments in the underground parking lot, our APS algorithm has been proven to be the best optimization method of APs spatial deployment for fingerprint based indoor positioning. From the tests with fixed or unfixed APs number, the proposed combination algorithm can generate the best optimal result, including both single-objective (indoor positioning error) evolution and multi-objective (positioning error and APs cost) evolution.

APs Deployment Optimization for Indoor Fingerprint Positioning

227

4 Conclusion The indoor positioning technology is becoming more and more important, since it can bring much convenience to people. With the development of various signal sources, a lot of indoor positioning approaches have been designed. Among existing algorithms, the fingerprint positioning is usually used with established fingerprint database. The efficiency of indoor positioning is affected by the spatial deployment of access points, which should be considered before APs installing. There are already some algorithms for APs deployment, such as uniform placement, linear programming and nonlinear optimization. Some complex optimization methods have been used for deployment of APs, such as GA, AIA, etc. To help overcome the disadvantage of PSO such as insufficient searching breadth, we propose a new algorithm, APS. The breadth searching ability of PSO is improved with an adaptive method, which can maintain the population diversity and avoid more inferior particles. The APS method has better depth and breadth searching abilities, and works well for APs deployment with single or multiple objectives. Based on a series of experiments with 107 iBeacons in an underground parking lot, our algorithm is tested and compared with the other optimization methods. It has been proven that the proposed APS achieves the best APs deployment with the least indoor positioning error, or the least integrated evaluation considering both positioning error and APs cost. All the algorithms are tested with fingerprint indoor positioning in underground parking lot, taking iBeacons as signal sources. In the future, APS will be tested in more positioning spaces, with more types of APs. The optimization algorithm will consider affects and constraints of different indoor environments, and complementary advantages from various kinds of APs. Our ultimate aim is to provide a very popular optimization method for APs spatial deployment, and help implement precise indoor positioning in complex spaces with multiple types of APs. Acknowledgments. This work was supported by the National Key Research and Development Program of China (Project No. 2016YFB0502201).

References 1. Li, C.C., Su, J., Chu, T.H., Liu, J.W.S.: Building/environment data/information enabled location specificity and indoor positioning. IEEE Internet Things J. 4, 2116–2128 (2017) 2. Zou, H., Wang, H., Xie, L., Jia, Q.S.: An RFID indoor positioning system by using weighted path loss and extreme learning machine. In: IEEE International Conference on Cyberphysical Systems, Taipei, Taiwan, pp. 66–71 (2013) 3. Khalajmehrabadi, A., Gatsis, N., Akopian, D.: Modern WLAN fingerprinting indoor positioning methods and deployment challenges. IEEE Commun. Surv. Tutor. 19, 1974– 2002 (2017) 4. Chen, K., Wang, C., Yin, Z., Jiang, H., Tan, G.: Slide: towards fast and accurate mobile fingerprinting for wi-fi indoor positioning systems. IEEE Sens. J. 18, 1213–1223 (2018) 5. Ma, Y.W., Chen, J.L., Liao, J.J., Tang, C.L.: Intelligent fingerprint-assisted for indoor positioning system. In: IEEE International Workshop on Electromagnetics, vol. 85, pp. 108– 109 (2014)

228

J. Zhao et al.

6. Xia, M., Chen, J., Song, C., Li, N., Chen, K.: The indoor positioning algorithm research based on improved location fingerprinting. In: 27th Chinese Control and Decision Conference, Qingdao, China, pp. 5736–5739 (2015) 7. Raspopoulos, M.: Multidevice map-constrained fingerprint-based indoor positioning using 3-D ray tracing. IEEE Trans. Instrum. Meas. 67, 466–476 (2018) 8. Dhillon, S.S., Chakrabarty, K.: Sensor placement for effective coverage and surveillance in distributed sensor networks. In: Wireless Communications and Networking, WCNC, vol. 3, pp. 1609–1614 (2003) 9. Zhou, M., Qiu, F., Xu, K., Tian, Z., Wu, H.: Error bound analysis of indoor wi-fi location fingerprint based positioning for intelligent access point optimization via fisher information. Comput. Commun. 86, 57–74 (2016) 10. Du, X., Yang, K.: A map-assisted wifi AP placement algorithm enabling mobile device’s indoor positioning. IEEE Syst. J. 11, 1467–1475 (2017) 11. Chen, X., Zou, S.: Improved wi-fi indoor positioning based on particle swarm optimization. IEEE Sens. J. 17, 7143–7148 (2017) 12. Cai, Y., Guan, W., Wu, Y., Xie, C., Chen, Y., Fang, L.: Indoor high precision threedimensional positioning system based on visible light communication using particle swarm optimization. IEEE Photonics J. 9, 1–20 (2017)

Deployment Optimization of Indoor Positioning Signal Sources with Fireworks Algorithm Jianhui Zhao1, Shiqi Wen1, Haojun Ai1,2, and Bo Cai1 ✉ (

)

1

2

School of Computer Science, Wuhan University, Wuhan 430072, Hubei , China [email protected] Collaborative Innovation Center of Geospatial Technology, Wuhan 430079, Hubei , China

Abstract. Spatial deployment of signal sources affects performance of indoor positioning systems, thus has received more attentions in recent years. This paper presents a FWA method from fireworks algorithm, to provide the optimal deploy‐ ment solution. Taking fine chromosomes as fireworks, the explosion factors are set including the number of explosion sparks and the radius of all explosion sparks. The supplemented individuals are produced from explosion and random generation, which helps increase the diversity of population and guarantee the qualities of individuals. After crossover and mutation, population evolves to the next generation. The optimal result from evolutions refers to a deployment solu‐ tion, i.e., certain number of signal sources with their locations. The FWA algo‐ rithm has been tested to have good convergence ability by a series of experiments, with iBeacons based indoor positioning system in an underground parking lot and the fingerprint based indoor location method. Compared with the usually used optimization algorithms, FWA has the best searching ability in single-objective and multi-objective cases, and it obtains the best optimization result considering only positioning error, or both positioning error and the cost of iBeacons. There‐ fore, the proposed FWA provides optimal deployment of signal sources for indoor positioning systems. Keywords: Spatial deploying · Fireworks method · Indoor position Fingerprint

1

Introduction

Positioning technology can be divided into outdoor positioning and indoor positioning. GPS is the most famous outdoor positioning system, which implement locating by transmitting signal source, receiving signal intensity and calculating distances. Due to the irregularity of building structure and the complexity of indoor materials in certain complicated interiors such as shopping malls, there are different influences on the attenuation of satellite signal intensity. Therefore, people are trying to install sensors such as Wi-Fi, Bluetooth and LED for indoor positioning. Multiple signal sources, or even different types of them have been used for indoor locating. Jung and Han presented a WRMs calibration system that automates the initial construction and maintenance of Wi-Fi maps for crowdsourcing based indoor

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 229–238, 2018. https://doi.org/10.1007/978-3-030-05057-3_18

230

J. Zhao et al.

positioning, it uses crowdsourced fingerprints collected from numerous smartphones and incorporates an unsupervised learning algorithm [1]. Chen et al. used both a commodity flashlight and a smartphone to achieve linear positioning, which allows automatic mapping from received signal strength to the position on a line, serving as a building block for fingerprinting in general environments [2]. Popoola and Sinanović designed a low complex indoor positioning system, and the accuracy is improved using overlap between LED beams, while collision handling algorithms are designed for LED packets in the overlap region [3]. Zheng et al. proposed an optical indoor positioning scheme using a single LED as beacon and a camera as receiver, where the joint measured angle of arrival and received light strength are utilized as fingerprint to determine the position of receiver [4]. There are other kinds of sensors used as signal sources for indoor positioning, such as ultrasonic [5], ZigBee [6], radio maps [7], etc. Based on the prop‐ erties of different kinds of sensors, they can also be utilized together to combine their advantages together. Zou et al. implemented an indoor localization and tracking system, using smartphone built-in Inertial Measurement Unit (IMU) sensors, WiFi received signal strength measurements and the opportunistic iBeacon corrections based on particle filter [8]. In case that the number of signal sources is fixed, optimal spatial deployment can improve the positioning accuracy effectively. Besides, the optimization technology can help reduce the number of signal sources while maintains the level of positioning accu‐ racy. How to balance multiple factors such as precision, cost and so on, is the main problem of deployment optimization. The initially used spatial deployment is non-opti‐ mization method, i.e., the uniform coverage of signal sources whose core technique is to divide a space evenly. There are many ways for space dividing [9], e.g., triangulation, trilateration, hyperbolic localization, etc. Uniform coverage is simple, and works well for indoor environments with regular layout and less obstacles. However, most indoor environments are irregular and complex, thus are not suitable for uniform coverage. Maximum and minimum coverage [10] is an optimization method, which uses polyno‐ mial-time algorithms to determine the number of sensors and their placement to address the coverage optimization under the constraints of imprecise detection and terrain prop‐ erties. Compared with non-optimization methods, this scheme can achieve a relatively reasonable deployment of signal sources in complicated indoor spaces. Based on Cramer-Rao Lower Bound (CRLB) and Simulated Annealing (SA), Zhou et al. designed a method for APs placement, which focuses on the error bound analysis of indoor WiFi fingerprint based positioning for intelligent APs placement optimization by using Fisher Information Matrix (FIM) to characterize the relationship between positioning errors and signal distributions [11]. There are complex optimization methods, e.g., Particle Swarm Optimization (PSO), Artificial Immune Optimization (AIO), Genetic Algorithm Optimization (GAO), etc., and some of them have been adopted in spatial location optimization. Chen and Zou presented a Wi-Fi indoor positioning method using an improved unscented Kalman filter, and PSO is proposed to reduce the ranging error and improve the positioning accuracy [12]. Chen et al. predicted the next location of a mobile object based on AIO, taking into account the characteristics of short moving time and an elusive moving tendency [13]. Eldeeb et al. gave a GAO based framework to solve APs placement

Deployment Optimization of Indoor Positioning Signal Sources

231

problem, which finds APs setup with unique fingerprints at each signal test point while maximizing diversity among these fingerprints [14]. The Fireworks Algorithm (FWA) simulates the explosion process of fireworks, thus it can increase the diversity of fire‐ works, meanwhile maintain the quality of fireworks. Till now, there is only a few FWA based references [15] for spatial optimization, but no report for the indoor positioning applications. The advantage of FWA makes it have possible applications in spatial deployment of indoor signal sources, thus the FWA based algorithm is proposed for indoor positioning in this paper.

2

The iBeacons Based Indoor Positioning System

2.1 The iBeacons Based Testing Environment The testing environment is an underground parking lot, which is an indoor space with 2,800 m2. As shown in Fig. 1(a), each dot means an iBeacon in the space, while red dots represent an example of deployment with certain number of signal sources and their locations. The target of our work is to find an optimal deployment with less number of iBeacons and better locating accuracy. As shown in Fig. 1(b), the installed iBeacons are labeled with red circles. There are 107 iBeacon signal sources, and they are uniformly arranged in the space. For each iBeacon, the distances between it and its adjacent signal sources are about 4.5 m. The iBeacons are taken as reference points, and they are used to locate any position in the underground parking lot.

Fig. 1. The iBeacons based testing environment, (a) the layout of indoor positioning space, (b) the installed iBeacons (Color figure online)

232

J. Zhao et al.

2.2 Fingerprint Based Indoor Location Method In our work, the fingerprint based positioning approach is adopted, which includes fingerprint database establishment and fingerprint matching. For each reference point, Received Signal Strength Indicator (RSSI) from every signal source is collected to set up the fingerprint database. During fingerprint matching, the RSSIs of one observing point (any position to be located) is compared with fingerprints in database, and location of the observing point is computed from the most similar reference points. (1) Fingerprint database establishment The fingerprint database for n reference points is consisted of n records, and each record is consisted of n RSSIs from all signal sources. Thus, there is an n * n matrix in database, while each record is a fingerprint. If no signal strength can be received, the RSSI is set as zero for the related row and column in matrix. In our experiments, a mobile phone with android 5.5 is used to collect RSSIs from all iBeacons. For each reference point, the collecting time is 1 min. The acquisition frequency is once per 100 ms, so a total of 600 sets of 107 RSSIs are obtained. Then, the average values are calculated for the 600 sets, and the averaged 107 RSSIs are taken as the fingerprint for one reference point. Fingerprints for all reference points are stored into a XML file, which is the fingerprint database. (2) Fingerprint matching To locate an observing point, fingerprint is collected with n RSSIs from all signal sources. Then the fingerprint is compared with all records in established fingerprint database. The difference of 2 fingerprints is defined as Euclidean distance between the related 2 n-dimensional vectors. So the most similar fingerprint is the one with the least Euclidean distance. In our experiments, 3 similar fingerprints are found for every observing point, corresponding to 3 reference points with the smallest 3 Euclidean distances. Suppose the 3 Euclidean distances are w1, w2, w3, and the coordinates of 3 reference points are (x1, y1), (x2, y2), (x3, y3) respectively. Then coordinates of the observing point is estimated by: x = ((x(1)∕w(1)) + (x(2)∕w(2)) + (x(3)∕w(3)))∕((1∕w(1)) + (1∕w(2)) + (1∕w(3)))

(1)

y = ((y(1)∕w(1)) + (y(2)∕w(2)) + (y(3)∕w(3)))∕((1∕w(1)) + (1∕w(2)) + (1∕w(3)))

(2)

The coordinates (x, y) obtained from the above formula are regarded as the measured location of the observing point. Positioning error of the observing point is evaluated by the Euclidean distance between its measured location and true location. Obviously, the longer distance means the larger error. After measurements of all observing points, the averaged positioning error can be computed for them. Since the coordinates of all refer‐ ence points are known in our experiments, the reference points are directly used as observing points. That is, the mobile phone is placed in the location of each reference point, and then its position is measured.

Deployment Optimization of Indoor Positioning Signal Sources

233

(3) Fitness function To evaluate the located results, fitness function is defined, which may consider only the positioning precision, or consider multiple factors simultaneously. In our system, two factors are mainly considered, i.e., positioning error and cost of signal sources. Because only iBeacons are employed, cost of signal sources is the number of iBeacons being used for indoor positioning. With the increasing of the number of signal sources, the whole positioning error decreases, while the cost of system increases. How to combine these two factors to achieve a relatively optimal result is the problem to be solved in our system. To represent and evaluate the combination, we adopt the following fitness function: FFV = a ∗ PE + b ∗ NS

(3)

where FFV is fitness function value, PE is the whole position error, NS is the number of signal sources, while a and b are weighting parameters.

3

Fireworks Algorithm for Indoor Positioning

Fireworks algorithm (FWA) is used for deployment optimization of signal sources, which is the first application of FWA in indoor positioning to our knowledge. The opti‐ mization procedure is shown in Fig. 2, and its main steps are described as follows. (1) Initialization of fireworks In FWA method, there are many fireworks, and each firework is consisted of some sparks. One firework means a set of randomly generated spatial deployment of signal sources, while each spark of firework means a signal source. The initialization procedure of FWA is the same as that of Genetic Algorithm, i.e., firework refers to chromosome, and spark refers to gene. (2) Selection of fine fireworks For each firework, its fitness function value is calculated. In evolution procedure of FWA, fine fireworks should increase generation after generation to obtain the more optimal results. Therefore, a constant threshold is set for fitness function value to make sure the convergence of FWA. The values of all fireworks are compared with the fitness function threshold, and the fireworks with less values than threshold are selected as fine fireworks. For the fine fireworks, they are ordered with their fitness function values from small to large. (3) Set of explosion factors For every fine firework, its explosion factors include the number of explosion sparks and the radius of all explosion sparks. The number of explosion sparks of fine firework xi is computed by:

234

J. Zhao et al.

Fig. 2. Flowchart of FWA for indoor positioning

y − f (xi ) + a si = m ∗ ∑n max (ymax − f (xi )) + a i=1

(4)

where m represents the total number of sparks of the xi firework, f (xi ) represents the fitness function value of the xi firework, ymax represents the maximum fitness function value of all m fireworks of current generation, the constant a is used to avoid the denom‐ inator from becoming zero. The radius of all explosion sparks in fine firework xi is computed as:

Ai = A ∗ ∑n

f (xi ) − ymin + a

i=1

(f (xi ) − ymin ) + a

(5)

where A represents the maximum explosion radius value set in advance, ymin represents the minimum fitness function value of all m fireworks of the current generation, the other parameters are the same as Eq. (4). When a firework is exploded, the new sparks with the number from Eq. (4) are randomly selected within the range of radius from Eq. (5), while the new sparks should be different from the old ones. (4) Supplement of fireworks Except for the mf fine fireworks, the other mi ones of current generation are discarded since their fitness function values are too large. To make the population size m

Deployment Optimization of Indoor Positioning Signal Sources

235

unchanged, fireworks need to be supplemented. There are two cases: mi > mf and mi mf, the fine fireworks explode using explosion factors to generate mf new fireworks, then the (mi-mf) fireworks are randomly generated the same as initi‐ alization procedure. When mi 4.5), which is obviously different from other stages. Moreover, the SS stage can also be better differentiated from other sleep stages based on the entropy value. However, the difference between the S1 stage and the REM stage is less obvious, and it requires to further study.

246

3.4

X. Shao et al.

Discussion

This study used psychophysics method to measure the threshold range of sleep stages based on single-channel EEG signals by calculating the value of fuzzy entropy. On the one hand, we studied the scale shadow of MFE for sleep stage thresholds. The effect of the experiment show that the fuzzy entropy changes with the change of the scale factor. When the scale factor s = 2, the MFE obtains the maximum value and the threshold resolution of the sleep stages are improved (Fig. 5).

Fig. 5. Influence of different scale factors on fuzzy entropy

On the other hand, we studied the gender differences in sleep stage thresholds by comparing the fuzzy entropy thresholds of 10 data samples (5 males and 5 females). It is more convincing that explains statistically the gender differences in sleep stage thresholds based on T-test as shown in Table 1 (Sig:ð2Þ  0:05,). Table 1. T-test results of different gender T-test of mean equation t f Sig. (2) Mean difference Standard error value Confidence interval (q ¼ 95%) Subthreshold Threshold W 4.619 8 0.002 0.923 0.199 0.462 1.384 S1 2.983 8 0.018 0.669 0.224 0.151 1.186 S2 3.833 8 0.016 0.539 0.140 0.214 0.863 SS 3.645 8 0.007 0.592 0.162 0.217 0.966 REM 4.559 8 0.004 0.761 0.166 0.376 1.146

The experimental results show that there are significant differences in the entropy threshold of sleep stages between different gender, and the sleep threshold of female is significantly higher than that of male. It may be related to the active areas of the brain in both male and female [9], and we find the explanations in psychophysiology. Gur et al. [10] used fMRI to find that in the unit volume of the brain, female have higher

A Study of Sleep Stages Threshold Based on Multiscale Fuzzy Entropy

247

gray matter than male, and the male have higher white matter than female. In addition, one of the reasons for the threshold difference is the social differences between male and female. Of course, it needs a further research.

4 Conclusion In this paper, The CEEMDAN and MFE method were used to study the threshold of sleep stages based on single-channel EEG signals. First, the adaptive EMD decomposition of the EEG data is performed. We can get a new high-precision EEG data. Then, the MFE of each new data is calculated and used as the feature of sleep stage threshold which provides a reference for the study of the automatic sleep stages classification. Finally, the influence of the fuzzy entropy scale factor and different gender samples on the sleep stage threshold was studied. The experimental results showed that the sleep threshold of female was significantly higher than male’s. We will continue our research in the future. On the one hand, our experimental sample size is too small to be universal and representative. On the other hand, the experimental results show that the sleep stage thresholds in S1 and REM stages cannot be accurately measured by using fuzzy entropy. This requires us to further study to better understand the meaning of sleep. Acknowledgments. National Natural Science Foundation of China (61373149) and the Taishan Scholars Program of Shandong Province, China.

References 1. Chen, X.: Automatic sleep staging based on EEG. Nanjing University of Posts and Telecommunications (2014) 2. Loomis, W.E., Shull, C.A., Snedecor, G.W.: Methods in Plant Physiology: A Laboratory Manual and Research Handbook. McGraw-Hill, New York City (1937) 3. Shao, X., Hu, B., Zheng, X.: A study on automatic sleep stage classification based on clustering algorithm. In: Zeng, Y., et al. (eds.) BI 2017. LNCS (LNAI), vol. 10654, pp. 139– 148. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70772-3_13 4. Tang, Q.: Automatic sleep staging based on EEG signals. Guangdong University of Technology (2016) 5. Cheng, J.: Sleep stage analysis based on EEG signals. Beijing Institute of Technology (2015) 6. Hassan, A.R., Bhuiyan, M.I.H.: Computer-aided sleep staging using complete ensemble empirical mode decomposition with adaptive noise and bootstrap aggregating. Biomed. Signal Process. Control 24, 1–10 (2016) 7. Tiantian, L., Yong, L.: Measurement of thresholds in facial expressions and their age and gender differences. Psychol. Behav. Res. 13(6), 771–777 (2015) 8. Jinde, Z., Minjun, C., Junsheng, C., et al.: Multi-scale fuzzy entropy and its application in fault diagnosis of rolling bearings. J. Vib. Eng. 27(1), 145–151 (2014)

248

X. Shao et al.

9. Lee, T.M., Liu, H.L., Hoosain, R., et al.: Gender differences in neural correlates of recognition of happy and sad faces in humans assessed by functional magnetic resonance imaging. Neurosci. Lett. 333(1), 13–16 (2002) 10. Gur, R.C., Gunningdixon, F., Bilker, W.B., et al.: Sex differences in temporo-limbic and frontal brain volumes of healthy adults. Cereb. Cortex 12(9), 998–1003 (2002)

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels Jing Liu1, Kun Han1, Wenhua Wu1, Shu Wang2, and Xiao Yu3 ✉ (

1

2

)

School of Information and Communication, National University of Defense Technology, Xian 710106, China Institute of Systems Engineering, Academy of Military Sciences, Beijing 100039, China 3 School of Computer Science and Technology, Shandong University of Technology, Shandong Zibo 255000, China [email protected]

Abstract. The Maximum likelihood (ML) estimation algorithm of timing devi‐ ation and carrier frequency offset in orthogonal frequency division multiplexing (OFDM) system is studied, and the ML algorithm is extended to the fast fading multipath wireless channel environment using the multi-symbol Joint estimation technique. This method is based on the autocorrelation of cyclic prefixes (CP) in OFDM blocks without training data, the spectral efficiency and throughput of the system are improved. Meanwhile, in the case of the extremum of signal-to-noise ratio, two algorithms are deduced, which are suboptimal but less computational complexity and more adaptable to channel. Simulation results indicate that this scheme can effectively improve the estimation performance of symbol timing deviation and carrier frequency offset in fast fading multipath channel. Keywords: OFDM · ML estimation · Synchronization · Multipath fading

1

Introduction

OFDM have advantages of higher spectral efficiency and eliminating interference within cells has been recently received great attention, which is widely applied to the digital audio and video broadcasting system, indoor broadband wireless system, etc. [1, 2]. Because of separating one subcarrier from other subcarriers by utilizing the orthogonal characteristics, symbol timing offset and carrier frequency offset have made great effects on system performance, such as FFT window offset and inter-carrier interference. By inserting pilot symbol and training serial, the existing synchronization algorithm is a simple technology which is always used to the time-varying multipath system. The technology decreases spectral efficiency and throughput. To improve system perform‐ ance, blind synchronization technology based on slow-varying channel models is widely studied, which needs a large quantity of OFDM data block and including cycle prefix [3, 4] and cyclostationarity [5–7]. The paper proposes a blind synchronization algorithm based on ML which does not need the pilot symbol and is suitable for fast-fading multi‐ path OFDM system.

© Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 249–256, 2018. https://doi.org/10.1007/978-3-030-05057-3_20

250

2

J. Liu et al.

Signal Model

2.1 OFDM System Model The baseband model of OFDM system is shown in Fig. 1 [8, 9]. Assume that there are N subcarriers and transmitted symbols are defined as X(0), X(1), …, X(N -1). After fast Fourier transforming, spectral signals can be transformed into time signals x(0), x(1), …, x(N − 1). Cycle prefix with the length of D which is copied from the last D data of time domain data block is added to the front of each data block, i.e. x(k) = x(k + N), k ∈ [−D; −1] [10, 11]. Time domain discrete signal to transmit can be expressed as [12–14]:

x(k) = x(k + N), k ∈ [−D, −1]

(1)

where 𝜎s2 denotes transmitted energy per symbol, X(n) denotes the signal with mean of 0 and variance of 1, and x(k); k ∈ [0, N − 1] denotes the signal with mean of 0 and variance of 𝜎s2.

Fig. 1. OFDM system model

The signal received from the fast-fading multipath channel can be written as:

y(k) =

L ∑

h(k, l)x(k − l),

(2)

l=0

Where L + 1 denotes channel length and is lower than the length of cycle prefix D, and h(k, l), l = 0, … , L is impulse response of the channel. The correction of h(k, l) can be expressed as:

{ } E h(k1 , l1 ), h∗ (k2 , l2 ) = 𝛾J0 (2𝜋fD T ||k1 − k2 |∕N )e−l∕D |l=l1 =l2 ,

(3)

Where 𝛾 is normalized constant, J0 (.) is the first kind Bessel function, fD denotes the max Doppler shift, and T is the valid symbol time. The received signal can be expressed as:

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels

r(k) = y(k − 𝜃)ej2𝜋k𝜀∕N + w(k),

251

(4)

Where 𝜃 denotes discrete delay during signal transmitting, 𝜀 is the normalized carrier offset and w(k) is AWGN with a variance of 𝜎w2 . 2.2 Signal Correction Cycle prefix in the symbol of OFDM to send is the data block which is the copy of the last D data. We define the set I as the cycle prefix corresponding to received data and I ∗ as the original data which is copied to cycle prefix where the data in I correspond to those in I ∗. For k ∈ I we have L ⎧ ∑ −l∕D 2 2 2 2 m = 0, ⎪ 𝛾 J0 (0)e 𝜎s + 𝜎w = 𝛽0 𝜎s + 𝜎w , l=0 ⎪ L ∗ E{r(k)r (k + m)} = ⎨ ∑ −l∕D 2 j2𝜋𝜀 2 j2𝜋𝜀 2 ⎪ 𝛾 J0 (2𝜋fD T)e 𝜎s e 𝜎s = 𝛽1 e 𝜎s , m = N, l=0 ⎪ otherwise, ⎩ 0,

Where 𝛽0 = 𝛾

L ∑

J0 (0)e−l∕D, 𝛽1 = 𝛾

l=0

l=0

E{r(k)r∗ (k + m)} = 0. SNR is defined as

3

L ∑

(5)

J0 (2𝜋fD T)e−l∕D. For k ∉ I and m ∉ 0,

𝛽0 𝜎s2 𝜎w2

.

Optimal Estimated Value Based on ML

In the last section, the symbol to send and channel noise is assumed to be a complex Gaussian signal. For k ∈ I, the joint probability density function (pdf) of the received signal can be expressed as: { }) |r(k)|2 + |r(k + N)|2 − 2𝜌Re ej2𝜋𝜀 r(k)r∗ (k + N) exp − ) ( (1 − 𝜌2 ) 𝛽0 𝜎s2 + 𝜎w2 f (r(k), r(k + N)|𝜃, 𝜀 ) = ( )2 𝜋 2 (1 − 𝜌2 ) 𝛽0 𝜎s2 + 𝜎w2 (

(6)

Where weighting coefficient ρ is defined as: 𝛽1 𝜎 s |E{r(k)r∗ (k + N)}| 𝜌= √ = √ { 2 2 } } { 𝛽 𝜎 0 s + 𝜎w E |r(k)|2 E |r(k + N)|2 2

For all k, pdf of the received signal can be written as:

(7)

252

J. Liu et al.

(

|r(k)|2

exp − ( ) 𝛽0 𝜎s2 + 𝜎w2 f (r(k)|𝜃, 𝜀 ) = ( ) 𝜋 𝛽0 𝜎s2 + 𝜎w2

) (8)

.

By using the vector form of received signal, logarithm likelihood function can be written as: log f (r|𝜃, 𝜀) ) ( ∏ ∏ f (r(k), r(k + N)|𝜃, 𝜀) f (r(k)|𝜃, 𝜀) = log = log

k∉I∪I ∗

k∈I

(

∏ k∈I

)

(9)

f (r(k), r(k + N)|𝜃, 𝜀) ∏ f (r(k)|𝜃, 𝜀) , f (r(k)|𝜃, 𝜀)f (r(k + N)|𝜃, 𝜀) k

Where f (.) is pdf of random variable margin. From (8), we can find that f (r(k)|𝜃, 𝜀 ) does not correspond to 𝜃 or 𝜀. Assuming that the number of received data blocks is M, we have I = {𝜃, … , 𝜃 + D − 1, 𝜃 + K, … , 𝜃 + K + D − 1, … , 𝜃 + (M − 1)K, … , 𝜃 + (M − 1)K + D − 1},

(10)

Where K = N + D the total length of one OFDM data block. By substituting (9), (10) and (12) into (11), we have M−1 𝜃+D−1

log f (r|𝜃, 𝜀 ) = C1 + C2

∑ ∑ [ { } Re ej2𝜋𝜀 r(k + iK)r∗ (k + iK + N) i=0

k=𝜃

𝜃+D−1

)] −𝜌 ∑ ( |r(k + iK)|2 + |r(k + iK + N)|2 , 2 k=𝜃

(11)

where 𝜃+D−1

C1 =



log(1 − 𝜌2 ),

k=𝜃

2𝜌 . C2 = 2 (1 − 𝜌 )(𝛽0 𝜎s2 + 𝜎w2 )

(12)

By transforming (11), we have

𝜌 log f (r|𝜃, 𝜀) = ||T1 (𝜃)|| cos(2𝜋 + ∠T1 (𝜃)) − T2 (𝜃), 2 Where ∠ denotes the plural phase.

(13)

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels

253

M−1 𝜃+D−1

T1 (𝜃) =

∑ ∑ i=0

∑ ∑ ( i=0

(14)

) |r(k + iK)|2 + |r(k + iK + N)|2 .

(15)

k=𝜃

M−1 𝜃+D−1

T2 (𝜃) =

r(k + iK)r∗ (k + iK + N),

k=𝜃

T1 (𝜃) can be seen as self-correction of signal and T2 (𝜃) is accumulating energy func‐ tion [15]. The ML estimated value of 𝜃 and that of 𝜀 can be computed by maximizing (13). We can compute these values in two steps as following:

max log f (r|𝜃, 𝜀) 𝜃,𝜀

= max max log f (r|𝜃, 𝜀) 𝜃

𝜀

(16)

= max log f (r|𝜃, 𝜀ML (𝜃)). 𝜃

Where 𝜀 is between [0, 2𝜋]. The ML estimated value of 𝜀 is: 𝜀ML (𝜃) =

1 ∠T (𝜃). 2𝜋 1

(17)

By substituting (16) and (18) into (13), the ML estimated value of 𝜃 is

𝜌 𝜃ML = arg max ||T1 (𝜃)|| − T2 (𝜃). 𝜃 2

(18)

From (17) and (18), we can find that the value of the variance to estimate depends on the length of OFDM data block M, the length of cycle prefix D and weighting coef‐ ficient 𝜌. Meanwhile the variance to estimate T1 (𝜃) determines the performance of esti‐ mate algorithm. When 𝜃ML = 𝜃, we have the largest value of T1 (𝜃)‘s altitude. From (18), we weighting coefficient 𝜌 is computed according to existing station of the channel. When the SNR is very large, 𝛽0 𝜎s2 ≫ 𝜎w2 . After substituting it into (10), we have 𝜌 → 1. We substitute it into (18) and get the value of 𝜃, which is the result from the MMSE-likely algorithm. It can be written as: 1 𝜃MMSE = arg max ||T1 (𝜃)|| − T2 (𝜃). 𝜃 2

(19)

When the value of signal to noise ratio (SNR) is very low, 𝛽0 𝜎s2 ≫ 𝜎w2 .After substi‐ tuting it into (10), we have 𝜌 → 0. We substitute it into (18) and get the value of 𝜃, which is the result from the MC-likely algorithm. It can be written as: 𝜃MC = arg max ||T1 (𝜃)||. 𝜃

(20)

From (19) and (20), we can find that the computation complexity of these two algo‐ rithms is very little and they both adapt well to the channel. For different value of SNR,

254

J. Liu et al.

we can get estimate algorithm of 𝜀, which is shown in (17). Note that the performance of 𝜀 is related to estimate result of 𝜃.

4

Results and Analysis

The performance of proposed algorithm based on ML algorithm is estimated by Monte Carlo method in this section. The parameters is set as [16]: 20 symbols are transmitted in OFDM system, N = 128, T = 224 us, L = 20, M = 20, D/N = 1/4, and fD= 1 kHz. The gain for each channel follows the same Gaussian distribution and independently. Timing deviation 𝜃 = 50 and frequency offset 𝜀 = 0.1. The variance curve of Timing deviation estimation VS SNR and multipath length of traditional Maximum mean square error, that of MC, and that of the proposed algorithm based on ML algorithm are shown in Figs. 2 and 3. From Fig. 2 we can find that the performance of these three algorithms increases as SNR increases. From Fig. 3 we can find that when the length of multipath increase, the estimation performance of these three algorithms decreases with the decrease of channel quality. In these three algo‐ rithms, the performance is the worst with the lowest computation complexity.

Fig. 2. Timing deviation estimation VS SNR

Fig. 3. Timing deviation estimation VS the length of multipath increase

Blind Estimation Algorithm Over Fast-Fading Multipath OFDM Channels

255

The variance curve of frequency offset estimation VS SNR and multipath length of traditional Maximum mean square error, that of MC, and that of the proposed algorithm based on ML algorithm are shown in Figs. 2 and 3. From Fig. 4 we can find that the performance of these three algorithms increase as SNR increases and this performance of these three algorithms is the same adaptability to SNR. From Fig. 5 we find that the performance of these three algorithms decrease with the increase of multipath length and the performance of MC is the worst. From Fig. 2 to Fig. 5, we find the performance of the presented algorithm can meet with the requirement of real system under some condition. Compared with other blind estimation algorithms, these three algorithms base on ML algorithm get much better performance.

Fig. 4. Frequency offset estimation VS SNR

Fig. 5. Frequency offset estimation VS multipath length

5

Conclusions

Estimation algorithm based on ML algorithm is proposed to resolve the problem of blind synchronization over fast-fading multipath channels. Two suboptimal estimation algo‐ rithms with low computation complexity are conducted under different SNR. The

256

J. Liu et al.

presented algorithms without training data can increase spectral efficiency throughput. Results of simulation indicate that the presented algorithms can improve the perform‐ ance of estimating symbol timing deviation and carrier offset.

References 1. Lin, T.C., Phoong, S.M., New, A.: Cyclic-prefix based algorithm for blind CFO estimation in OFDM systems. IEEE Trans. Wireless Commun. 15(6), 3995–4008 (2016) 2. Fang, C., Gong, X., Huang, M.: On sequential blind channel estimation for time-varying OFDM system. In: IEEE International Conference on Ubiquitous Wireless Broadband, pp. 1–4 (2016) 3. Prakash, D., Pillai, S.S., Jayaprakash, A., Reddy, G.R.: A new blind carrier frequency offset estimation scheme for OFDM systems. In: International Conference on Communication & Signal Processing, pp. 1096–1100 (2016) 4. Lin, T.C., Pan, Y.C., Tai, W.J., Phoong, S.M.: An improved ESPRIT-based blind CFO estimation for OFDM in the presence of I/Q imbalance. Signal Process. Adv. Wireless Commun. 395(6), 639–643 (2013) 5. Sun, Z., Liu, R., Wang, W.: Joint time-frequency domain cyclostationarity-based approach to blind estimation of OFDM transmission parameters. Eurasip J. Wireless Commun. Network. 2013(1), 1–8 (2013) 6. Zhang, W., Gao, F., Yao, B.: Blind CFO estimation for multiuser OFDM uplink with large number of receive antennas. In: IEEE International Conference on Acoustics, vol. 64 (9), pp. 2255–2268 (2016) 7. Lim, J.: Joint estimation of CFO and channel in OFDM systems with blind noise statistics. IETE Tech. Rev., 1–13 (2016) 8. Liu, M., Li, B., Yang, Q., Tang, N.: Blind joint estimation for OFDM time-frequency parameters. Circuits Syst. Signal Process. 32(6), 2999–3012 (2013) 9. Liu, M., Li, B.: Bandwidth blind estimation for OFDM. In: IEEE International Conference on Digital Signal Processing, pp. 181–184 (2017) 10. Li, X., Hu, J., Wei, H., Yu, F., Wang, G.: Blind carrier and sampling frequency offsets estimation in OFDM system. In: Wireless Communications & Networking Conference, pp. 1–6 (2017) 11. Saci, A., Al-Dweik, A., Shami, A., Iraqi, Y.: One-shot blind channel estimation for OFDM systems over frequency-selective fading channels. IEEE Trans. Commun. 65(12), 5445–5458 (2017) 12. Jayaprakash, A., Reddy, G.R.: Robust blind carrier frequency offset estimation algorithm for OFDM systems. Wireless Pers. Commun. 94(3), 1–15 (2017) 13. Tian, J., Zhou, T., Xu, T., Hu, H., Li, M.: Blind estimation of channel order and SNR for OFDM systems. IEEE Access PP(99), 1 (2018) 14. Wang, Y.C., Phoong, S.M.: Blind estimation of symbol timing offset in OFDM systems. In: IEEE International Workshop on Signal Processing Advances in Wireless Communications, pp. 1–5 (2017) 15. Ramadhan, M., Bouzidi, D.A., Iyad, D.: A low complexity joint semi-blind estimation of CFO and channel for OFDM systems. In: International Conference on Electrical Engineeringboumerdes, pp. 1–6 (2017) 16. Lin, T.C., Phoong, S.M.: MSE-optimized CP-based CFO estimation in OFDM systems over multipath channels. In: Asia-pacific Signal & Information Processing Association Summit & Conference, pp. 818–822 (2018)

Facial Shape and Expression Transfer via Non-rigid Image Deformation Huabing Zhou1 , Shiqiang Ren1 , Yong Zhou2(B) , Yuyu Kuang1 , Yanduo Zhang1 , Wei Zhang1 , Tao Lu1 , Hanwen Chen1 , and Deng Chen1 1

2

Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China Yangtze University College of Technology and Engineering, Jingzhou 434100, China [email protected]

Abstract. In this paper, we present a novel approach for transferring shape and expression of a face in image to that of another, regardless of variance between the two faces in illumination, color, texture, resolution and even some mild occlusion. We first use a face alignment algorithm to locate accurate facial landmark points for both original face and target face, then align them with a global similarity transformation to eliminate their inconsistency in pose, size and position. Finally, we use our non-rigid image deformation method to deform the original face by fitting a map function for each of its pixel point according to the two sets of facial landmark points. Our method can be full-automatic or semiautomatic for conveniently tuning a better result by combining a face alignment algorithm and a non-rigid image deformation method. Experiment results show that our method can produce realistic, natural and artifact-less facial shape and expression transfer. We also discuss the limitation and potential of our proposed method. Keywords: Non-rigid image deformation Expression transfer

1

· Face editing

Introduction

Image deformation, which refers to deforming objects into desired shapes or poses, has long been an active research area in image processing. Specially, face deformation aims at deforming faces to obtain new face images with expected shape or expression, It has a number of useful applications ranging from face image beautify, medical imaging and facial animation in the entertainment industries. However, due to the fact that human face has extremely complex geometric form and movement mechanism, as well as subtle variations in color and texture. The authors gratefully acknowledge the financial supports from the National Natural Science Foundation of China under Grant Nos. 41501505, 61502354 and the Scientific Research Project of Education Department of Hubei Province under Grant No. Q20181508. c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 257–269, 2018. https://doi.org/10.1007/978-3-030-05057-3_21

258

H. Zhou et al.

Lots of works tried to challenge this problem in different ways. In image blending based methods [6,9,18], to transfer the expression from one face to that of another, the target face with expected expression is cut and pasted to the original face following with a seamless blending [5]. These methods can create quite realistic expression transfer when the two faces have similar color and texture, but they may change the identity of original face and are not robust to in-consistencies in illumination, color, texture, resolution or occlusion. Morphbased approaches [3,19] synthesize new expressions which are between two different facial expressions through interpolation, one limitation of these methods is not capable of transferring facial expression between different people.

Original f aces

Def ormed f aces

T arget f aces

Fig. 1. Three example of face shape and expression transfer with our face deformation method. The left column shows original faces, namely to be deformed faces; the middle column shows results of facial expression and shape transfer from target faces to original faces with our method; the right column shows target faces which contain expected facial expression and shape.

Image deformation methods are one of the common ways of dealing with these troubles. Shen et al. [22] achieve face smilization by using image deformation method to deform a normal face to smile one. Deformation methods view face deformation as a mapping from original face to deformed face and solving of the mapping function only rely on face counter information, which means deformation methods can achieve facial shape and expression transfer between different people regardless of variances in illumination, color, texture, resolution and even occlusion [25,27].

Facial Shape and Expression Transfer via Non-rigid Image Deformation

259

Fig. 2. Framework of our method. Firstly, landmark points are extracted from the original and target face, then landmark points of target face are align to that of original face with a global similarity transformation, finally the original face are deformed according to original face landmark points and aligned target face landmark points with our nonrigid image deformation method.

Many image deformation methods have been proposed to satisfy the requirements such as intuitive user interaction, realistic deformation results. Among them, the methods that avoid unnatural local scaling and shearing are of special interest. To produce such deformations, Schaefer [21] proposed Moving Least Squares (MLS) [8] using linear functions such as rigid transformation. The use of MLS and rigid formations makes the deformation as-rigid-as-possible. [1] However, the deformation methods mentioned above are modeled for deformation of general objects, the special geometrical structure features of face are not taken into account. These geometrical structure features can provide intrinsic structure information of the original face, which is beneficial to the face deformation estimation. Therefore, we need to develop a non-rigid model. To address these issues and produce more realistic and natural face deformation, we propose a new algorithm based on MLS and a non-rigid transformation modeled [10–13,15] by specifying it in a reproducing kernel Hilbert space (RKHS) [2,14]. Furthermore, taking the special geometrical structure features of face into account, we introduce the local neighborhood structure constraint into our model as a regularization term. Benefiting from the combination of these factors, our algorithm can avoid superfluous global or local deformation, and lead to more natural and realistic face deformation. Specifying the transformation in an RKHS leads to a simple closed-form solution which is computationally efficient. In general, image deformation methods are typically controlled with a set of distinct handles, including points [4], lines [3], and even polygon grids [17], which are usually chosen by users manually. In our face deformation method, we use the facial landmark points as handles. Benefiting from the impressive progress in realtime face detection and facial landmark alignment in recent years [7,20,23,24],

260

H. Zhou et al.

we use a face alignment algorithm [7] to locate accurate facial landmarks rather than choose manually, which make it possible to achieve automatic facial expression and shape transfer by using one face image to drive the deformation of another face image. As shown in Fig. 1, there are three examples of facial expression transfer using our face deformation method, in which the expression and shape of the target faces are transferred to the original face. Our contribution in this paper include the following two aspects. First, we propose a novel non-rigid model with local neighborhood structure regularization to deal with face deformation, which can capture the intrinsic geometry of the input face image and hence help to produce realistic deformation. Second, combing the face alignment algorithm, we present a fast and automatic approach of facial expression and shape transfer with our face deformation method.

2

Facial Expression and Shape Transfer

The framework of our method is showed in Fig. 2. To achieve automatic facial expression and shape transfer between two faces, we first utilize a cascade regression tree based face alignment algorithm [7] to extract 68 accurate face landmark points from the two face images. This algorithm is quite robust to face images of sculpture, sketch, comic and painting. Then, to eliminate the inconsistencies between the two faces in position, size and rotation, we align the target landmarks to original landmarks with a similar transformation which can be solved with an ordinary Procrustes analysis. After the transformation, the target face landmarks have a similar size, pose and location to the source face, and in the same time, it retains the contour details which are of fatal importance for the face deformation. Finally, we transfer facial expression and shape from target face to original face by deforming the original face to the shape of target face with our non-rigid face deformation method. n n Let X = {xi }i=1 be the original face landmark points and Y = {yi }i=1 be the aligned target face landmark points, where xi and yi are column vectors and present coordinate of the i-th landmark point, n is the numbers of points in the two sets. The deformation can be viewed as a map f from original image to deformed image, each pixel point p in original image have a unique map function fp : fp (p) = p + gp (p)

(1)

where gp is the displacement function which is solved through the interpolation of X and Y , and fp (p) is coordinate in deformed image where p is mapped to. More details of the deformation will be discussed in the following sections. We can see that all the color and texture features of the deformed image are mapped from the original image and only landmark features are from the target face image; this lead to a fact that our method can avoid artifacts caused by inconsistency of the two faces in illumination, color, texture and resolution.

Facial Shape and Expression Transfer via Non-rigid Image Deformation

3

261

Local Feature Guided Non-rigid Image Deformation

In this section, we describe the detail of our non-rigid face deformation algorithm. As mentioned above, the deformation is built according to two sets of distinct points. Let X be a set of control points(original face landmarks), and Y be the corresponding target points(aligned target face landmarks). We view the deformation as a function f that maps the points in the original image to those in the deformed image, and formulate the function estimation as a vector-field interpolation that should satisfy the following three properties [21]: (i) Interpolation: the points {xi }ni=1 should map directly to {yi }ni=1 under deformation; (ii) Smoothness: f should produce smooth deformations; (iii) Identity: f should be the identity function if the deformed handles {yi }ni=1 are the same as {xi }ni=1 (i.e., ∀i, xi = yi ⇒ f (x) = x with x being an arbitrary point in the image). These properties are very similar to those used in scattered data interpolation. Thus, we construct a non-rigid deformation function f satisfying these three properties with a closed-form solution. 3.1

Problem Formulation

The mathematical formulation of the deformation problem is based on Moving Least Squares (MLS) [21]. For each point p in the image, MLS is used to solve for a rigid-body transformation fp (x) that minimizes a weighted least squares error functional: n  wi (p)fp (xi ) − yi 2 (2) i=1

where wi (p) is a non-negative weight function defined as wi (p) = p − xi −2α

(3)

where α controlling the weight of each control point and  ·  being the Euclidean distance. The global deformation function f is obtained from a set of local functions, and is defined as f (p) = fp (p), which is continuously differentiable. As traditional MLS method model the deformation with rigid transformation for general objects, to specialize in face deformation, we consider to generalize the formulation to the non-rigid case and take the special geometrical structure features of face into account. To generalize this formulation to the non-rigid case, we first replace the deformation function in MLS method with a non-rigid one. As mentioned in Eq. (1), we model the non-rigid displacement function gp (p) by requiring it to lie within a specific functional space, namely a reproducing kernel Hilbert space (RKHS). We define an RKHS by a positive definite matrix-valued kernel Γ : IR2 × IR2 → IR2×2 [16] and here we choose a diagonal decomposable kernel: 2

Γ (xi , xj ) = e−xi −xj 

/β 2

I

(4)

with β determining the width of the range of interaction between points and I is an identity matrix.

262

H. Zhou et al.

The optimal displacement function gp (p) then takes the form: gp (x) =

n 

Γ (x, xi )ci

(5)

i=1

where the coefficient ci is 2 × 1 vector (to be determined). To take advantage of geometrical structure features of face, we introduce the local neighborhood structure regularization, for the local structures among neighboring feature points are very strong and stable. This is particularly beneficial to the non-rigid facial movement. Therefore, we preserve the local neighborhood structure with a local geometrical constraint during deformation. In our deformation problem, we hope that the local structures in Y could be preserved after the displacement of X, which could be achieved by the following three steps [26]. First, search the k nearest neighbors for each point in X, and enforce the weight Mij = 0 if xj does not belong to the set of neighbors of xi , where M is an n×n neighboring weight matrix with Mij summarizing the contribution of xj to xi for reconstruction. Second, minimize the reconstruction errors measured by the cost function:  2  n  n     xi −  M x (6) E(M ) = ij j    i=1  j=1 n under a constraint that the rows of the weight matrix sum to one: j=1 Mij = 1 The optimal neighboring weights Mij can be obtained by solving a least squares problem. Third, the local geometry of each control point after the transformation f is preserved by minimizing the cost function:  2   n n      wi (p)xi + gp (xi ) − Mij (xj + gp (xj )) (7)    i=1 j=1 Combining the moving last square error term in Eq. (2) and the local regularization term in Eq. (7), the optimal displacement function gp can be solved by minimizing: n 

wi (p)xi + gp (xi ) − yi 

2

i=1

 2   n n      +η wi (p)xi + gp (xi ) − Mij (xj + gp (xj ))    i=1 j=1

(8)

where the positive real numbers η control the tradeoff between the two terms. With a close form solution of the coefficient set C, we define our deformation function as the initial position plus the displacement function: f (p) = p + (Γp C)T

(9)

Facial Shape and Expression Transfer via Non-rigid Image Deformation

263

where kernel vector Γp = (Γ (p, x1 ), . . . , Γ (p, xn )) with size 1×n and the coefficient matrix C = (c1 , c2 , . . . , cn )T with size n × 2. Note that this deformation function f is smooth, and as p approaches xi , wi (p) approaches infinity, and then the function interpolates, i.e., f (xi ) = yi . Moreover, if ∀i, xi = yi , then gp (p) ≡ 0, therefore, f is the identity transformation, i.e., f (p) = p. 3.2

Close-Form Solution

By substituting Eq. (5) into Eq. (8), it can be rewrite in the following matrix form:  2   E(C) = W 1/2 (X + Γ C − Y ) F (10)  2  1/2  +η W (X + Γ C − M (X + Γ C)) F

where the kernel matrix Γ ∈ IRn×n is called the Gram matrix with Γij = 2 2 e−xi −xj  /β , the weight matrix W is a diagonal matrix with the i-th entry determined by Eq. (3), X are the control points and Y are target points respectively, in which the i-th rows represent xi and yi , C is the coefficient matrix with size n × 2, and  · F denotes the Frobenius norm. Equation (10) is quadratic in C. Taking the derivative of it with respect to C and setting it to zero, we obtain a closed-form solution: C = (I + ηQW −1 )−1 Γ −1 Y − Γ −1 X

(11)

where I is the identity matrix, and Q = (I −M )T W (I −M ). With this closed-form solution for C, we can write a simple expression for the deformation function: f (p) = p + [Γp [(I + ηQW −1 )−1 Γ −1 Y − Γ −1 X]]T 2

(12) 2

where Γp is an row vector with the i-th entry Γp,i = e−p−xi  /β . To deform a new face image more efficiently, we approximate the original face image with a grid and apply the deformation function (12) to each vertex, then use a bilinear interpolation in each quad of the grid. We summarize our approach in Algorithm 1. 3.3

Computational Complexity and Fast Implementation

According to solution Eq. (12), the computation complexity is mainly determined by time complexity of solving the weight matrix M , Gram matrix Γ and the inversion of a matrix of size n × n. To search the k nearest neighbors for each point in X, the time complexity should be close to O((k + n) log n) by using the kd tree [20]. According to Eq. (6), the time complexity of obtaining the weight matrix M is O(k 3 n) because each row of M can be solved separately with O(k 3 ) time complexity. Due to the Gram matrix being of size n×n, the time complexity of solving the Γ is O(n2 ). Since weight matrix M and Gram matrix Γ share for deformation function of each point, namely they only need to be computed once, while inversion

264

H. Zhou et al.

Algorithm 1. The Proposed Algorithm

1 2 3 4 5 6 7 8 9 10 11 12

Input: original and target face, kernel Γ , parameters k, α, β, η Output: Deformed face Extract face landmark points and get the correspondences {xi , yi }n i=1 ; Construct the Gram matrix Γ based on {xi }n i=1 ; Search the k nearest neighbor for each point in X; Compute M by minimizing the cost function (6); Approximate the original face image with a grid; repeat Choose a vertex p on the grid; Compute the weight W by Eq. (3); Compute the vector Γp ; Compute f at vertex p by using Eq. (12); until all the vertexes are computed ; The deformed face is generated by a bilinear interpolation of {f (p)}.

of a matrix of size n × n in solution (13) are different for each point in the image, the total complexity of our method is O(k 3 n + n2 + n3 l). Since k  n  l, and it can be written as O(n3 l), where l is the number of vertex in the grid which is used for approximating the image. Moreover, users in general creates the deformations by manipulating the target point set, and the control points are fixed. Therefore, much of Eq. (12) can be precomputed. In particular, we can rewrite Eq. (12) in the form: (13) f (p) = S + (V Y )T where V = Γp (I + ηQW −1 )−1 Γ −1 and S = p − (Γp Γ −1 X)T can be precomputed leading to fast implementation. In this case, the time complexity of our algorithm is reduced to O(nl). Parameter Setting: There are mainly four parameters in our method: k, α, β and η. Parameter k controls the number of nearest neighbors for local neighborhood structure regularization. Parameter α controls the weight of each control points. Parameter β and η affect the amount of the local structure constraint, β determines how wide the range of interaction between points, η determines the tradeoff between the MLS error term and the local structure regularization term. We finally set k = 15, α = 2, β = 2 and η = 10 according to the parameter tuning experiments.

4

Experiment

In this section, we test our method on different types of face images. We use the dlib library to implement the cascade regression tree based face alignment algorithm [7], and extract the landmark points for both original and target face images. To demonstrate our method, we conduct the experiment based on a self-organized dataset which include various kind and style of face images. More exactly, the

Facial Shape and Expression Transfer via Non-rigid Image Deformation

265

dataset include face image of men, women, children and the style range from image of nature face, sculpture, sketch, comic and painting. The face images vary in factors such as illumination, color, texture, resolution and occlusion. Here, we present some representative types of face deformation. In Fig. 3, we show 4 representative facial expression and shape transfer results obtained with our method. To evaluate the performance of our method, we also report the results of MLS [21] and blending based face swap(face blending) [9] method as comparison. In the figure, the first row presents original faces and the fifth row presents target faces, while the second, third and fourth rows are the corresponding facial expression and shape transfer (from target faces to original faces) results of our method, MLS and face swap method.

Original f aces

Ours

M LS

F ace blinding

T arget f aces

Fig. 3. Face expression and shape transfer results of our method, MLS [21] and face blending method [9]. the first row: original face, the second row: results of our method, the third row: results of MLS, the fourth row: results of face blending method, the fifth row: target faces

266

H. Zhou et al.

First column shows facial expression and shape transfer between two head sculpture images, we can see that both deformation method (our method and MLS) and face swap method have their advantages and produce natural and smooth results. The result of face swap method shows more significant facial expression and shape transfer since more counter detail of target face are transferred, but it tend to change the identity of result face in the same time; our method retains most counter detail of original face and achieve transfer with smooth deformation; MLS method performs similar to our method but slightly poor in some details, e.g. unnatural curving in jaw and mouth. In the second column, we consider transfer facial expression and shape from an image of natural face to a face painting. We can see that the result of face swap method occurs obvious blur; the blur is caused by its blending operation which aim at eliminating the inconsistent in color, texture and revolution, while our method are not affected by these inconsistence and produce natural, smooth and clear deformation result, and for MLS method, the unexpected zigzag again appear on the lips. To further explore the performance of the three method, we consider facial expression and shape transfer between two face image with more significant difference in expression and factors such as illumination, color, texture and occlusion. As shown in the third and fourth column, both MLS and face swap method degenerate. From the third column of the figure, we can see that all the three method can transfer facial expression and shape in a large degree, but face swap method yield relatively poor results due to the obvious artifacts caused by the inconsistence of the two face images in color and collusion of glasses in the target face. For the result of MLS, there are some imperfections such as unnatural curving in the jaw and defects in the right brow, while result of our method are smooth and natural. From the fourth column of the figure, we can see that there are lots of flaw in the result of face swap method due to the inconsistence of the two face in color and texture; in the result of MLS, there are unnatural curving in jaw, mouth and brow; moreover, fold-over, another unexpected property of MLS, appears between left eye and brow. Our method is not troubled by the above problems which show that our method can achieve more natural and smooth face deformation and it’s quite robust to the inconsistence of original face and target face. In previous examples, we deform the entire face, namely transfer facial shape and expression at same time; however, our proposed method can only deform part of face according to your needs. Figure 4 shows two examples of transfer external face shape and internal facial expression separately by choosing different landmark points. The first column presents original faces and the fourth column presents target faces, the second column shows the results of internal expression transfer and the third column shows the results of external shape transfer. Figure 5 shows two example of generating facial animation with our method by using different face to drive a static face image deformation. Images in the left are original face images, in the top are target face images and in the bottom are deformed face image. The results show that the deformation is smooth, natural and artifact-less despite the significant variances between original face image and target face images.

Facial Shape and Expression Transfer via Non-rigid Image Deformation

original f aces

expression def orm

shape def orm

267

target f aces

Fig. 4. Two example of transfer external face shape and internal facial expression separately. The first column: original faces, the second column: the results of internal expression transfer, the third column: the results of external shape transfer, the fourth column: target faces

Fig. 5. Tow example of facial animation generate by our method. left: original faces, top: target faces, bottom facial animation.

268

5

H. Zhou et al.

Discussion and Future Work

As mentioned above, our method completely rely on original and target face landmarks points for both the accuracy and amount. As a result, inaccurate face landmark localization will lead to deadly affecting to our results and too few landmark points will cause the losing of detail counter information of target face, which farther lead to an insignificant deformation effect. Another limitation of our method is the deformation of mouth and eyes, e, g. if we try to deform a mouth or eye from close statement to open, there will be a hole since there is no information of these region in original image. We consider to use a generative model to solve this problem in the future work. A potential breakthrough of our non-rigid deformation method is that it is available for the deformation of 3D point cloud objects. We will try to generalize our method to 3D case in the future work.

6

Conclusion

Within this paper, we present a novel approach that transfers the appearance and expression of a face in image to that of another regardless of variance between the two faces in factors such as illumination, color, texture, resolution and even some mild occlusion. Our method can be full-automatic or semi-automatic for conveniently tuning a better result by combining a face alignment algorithm and a non-rigid image deformation method. The final results are realistic, natural and artifact-less.

References 1. Alexa, M., Cohen-Or, D., Levin, D.: As-rigid-as-possible shape interpolation. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 157–164 (2000) 2. Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337– 404 (1950) 3. Beier, T., Neely, S.: Feature-based image metamorphosis. ACM SIGGRAPH Comput. Graph. 26(2), 35–42 (1992) 4. Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. 11(6), 567–585 (2002) 5. Gangnet, M., Blake, A.: Poisson image editing. In: ACM SIGGRAPH, pp. 313–318 (2003) 6. Garrido, P., Valgaerts, L., Rehmsen, O., Thormaehlen, T., Perez, P., Theobalt, C.: Automatic face reenactment. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4217–4224 (2014) 7. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Computer Vision and Pattern Recognition. pp. 1867–1874 (2014) 8. Levin, D.: The approximation power of moving least-squares. Math. Comput. 67(224), 1517–1531 (1998) 9. Liu, L., Liu, L., Nie, X., Feng, J., Yan, S., Yan, S.: A live face swapper. In: ACM on Multimedia Conference, pp. 691–692 (2016)

Facial Shape and Expression Transfer via Non-rigid Image Deformation

269

10. Ma, J., Zhao, J., Tian, J., Bai, X., Tu, Z.: Regularized vector field learning with sparse approximation for mismatch removal. Pattern Recognit. 46(12), 3519–3532 (2013) 11. Ma, J., Zhao, J., Tian, J., Yuille, A.L., Tu, Z.: Robust point matching via vector field consensus. IEEE Trans. Image Process. 23(4), 1706–1721 (2014) 12. Ma, J., Zhao, J., Guo, H., Jiang, J., Zhou, H., Gao, Y.: Locality preserving matching. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4492–4498. AAAI Press (2017) 13. Ma, J., Zhao, J., Jiang, J., Zhou, H.: Non-rigid point set registration with robust transformation estimation under manifold regularization. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4218–4224 (2017) 14. Ma, J., Zhao, J., Tian, J.: Nonrigid image deformation using moving regularized least squares. IEEE Signal Process. Lett. 20(10), 988–991 (2013) 15. Ma, J., Zhao, J., Tian, J., Tu, Z., Yuille, A.L.: Robust estimation of nonrigid transformation for point set registration. In: Proceedings IEEE Conference Computer Vision Pattern Recognition, pp. 2147–2154 (2013) 16. Ma, J., Zhao, J., Tian, J., Yuille, A.L., Tu, Z.: Robust point matching via vector field consensus. IEEE Trans. Image Process. 23(4), 1706–1721 (2014) 17. Maccracken, R., Joy, K.I.: Free-form deformations with lattices of arbitrary topology. In: Conference on Computer Graphics and Interactive Techniques, pp. 181–188 (1996) 18. Min, F., Sang, N., Wang, Z.: Automatic face replacement in video based on 2D morphable model. In: International Conference on Pattern Recognition, pp. 2250– 2253 (2010) 19. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.H.: Synthesizing realistic facial expressions from photographs (1998) 20. Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014) 21. Schaefer, S., Mcphail, T., Warren, J.: Image deformation using moving least squares. ACM Trans. Graph. 25(3), 533–540 (2006) 22. Shen, S., Yamasaki, T., Aizawa, K., Sugahara, T.: Data-driven geometric face image smilization featuring moving least square based deformation. In: IEEE Third International Conference on Multimedia Big Data, pp. 220–225 (2017) 23. Xiao, S., Yan, S., Kassim, A.A.: Facial landmark detection via progressive initialization. In: IEEE International Conference on Computer Vision Workshop, pp. 986– 993 (2015) 24. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multitask learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 94–108. Springer, Cham (2014). https://doi.org/10.1007/9783-319-10599-4 7 25. Zhou, H., Kuang, Y., Yu, Z., Ren, S., Dai, A., Zhang, Y., Lu, T., Ma, J.: Non-rigid image deformation algorithm based on MRLS-TPS. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 2269–2273. IEEE (2017) 26. Zhou, H., Ma, J., Yang, C., Sun, S., Liu, R., Zhao, J.: Nonrigid feature matching for remote sensing images via probabilistic inference with global and local regularizations. IEEE Geosci. Remote Sens. Lett. 13(3), 374–378 (2016) 27. Zhou, H., Ma, J., Zhang, Y., Yu, Z., Ren, S., Chen, D.: Feature guided non-rigid image/surface deformation via moving least squares with manifold regularization. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1063–1068. IEEE (2017)

P-Schedule: Erasure Coding Schedule Strategy in Big Data Storage System Chao Yin, Haitao Lv(&), Tongfang Li, Yan Liu, Xiaoping Qu, and Sihao Yuan Jiujiang University, Jiujiang 332005, China [email protected] Abstract. Erasure coding technology is one of the key technologies in big data storage system. A well designed erasure coding can not only improve the reliability of the big data storage system, but also greatly improve the performance. Most of the existing big data storage systems use replica strategy, which can provide good availability and real-time, but it has caused a lot of data redundancy and waste of storage space. A large part of the data stored in the storage system exists in the form of cold data. In this paper, we aim at the cold data which doesn’t require highly on data availability and real-time in the big data storage system. We have proposed a scheme to support both replica strategy and coding strategy, and designed the node scheduling and data addressing scheme. We selected Liberation code which is excellent in writing operation, and developed P-Schedule scheme to optimize the decoding speed. Through a series of designs, we can effectively improve the disk utilization and write speed of the cold data in the big data system. The test results show that the sequential write performance of erasure coding is better than that of the replica strategy. The larger the data block is, the better the performance is. Keywords: Big data

 Erasure coding  Liberation  P-Schedule

1 Introduction Since the birth of the Internet, the data grows in an explosive way [1]. Especially in recent years, the development of mobile terminals, networking, cloud computing, which make the speed of data growth faster and faster. “China Mobile Phone Market Research Report in 2016–2017” shows that internet users in China have reached 668 million until June 2016, and mobile phone users are 593 million. According to the results report released by Tencent Inc. in 2017, active users have reached 549 million in WeChat monthly. The more data there is, the more important they are. The popularity of deep learning [2] and big data [3, 4] is a very good description in recent years. While the traditional centralized storage can’t afford such a large amount of data, the big data storage comes into being in such a context [5]. Combining with the advantages of storage systems and network systems, big data storage can provide more reliability, security and scalability. In order to guarantee the reliability of the data, the big data system needs to use some kind of backup scheme to avoid data loss when one node is damaged [6]. The existing big data system always adopts replica scheme [7], which will cause a lot of waste of storage space. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 270–279, 2018. https://doi.org/10.1007/978-3-030-05057-3_22

P-Schedule: Erasure Coding Schedule Strategy

271

The traditional RAID [8] has used erasure coding to store, especially RAID6 uses erasure coding to control the data redundancy in a very low level at the same backup situation. There are many erasure codes similar RAID6, RDP [9, 10], EVENODD [11], STAR [12] and so on. However, because of the complexity and data availability, and a large amount of network bandwidth occupied by erasure codes in the encoding and decoding process, large-scale erasure codes scheme has not been introduced to big data system. We have proposed an erasure coding scheme to solve the storage pressure and data availability in cold data storage. This scheme can greatly reduce the data redundancy. Considering that cold data usually occupies a large proportion in the system, this scheme has focused on optimizing the sequential write operations of cold data and greatly improves the writing performance. The contributions of this paper are described as follows: 1. After comparing and analyzing the advantages and disadvantages of big data storage system in replica strategy and erasure code strategy, we have proposed to solve the problem that cold data causing a lot of waste of storage space by the application of erasure code. 2. We have proposed to improve cold data storage capacity of big data storage system by using Liberation code scheme, and we have successfully designed and implemented erasure codes in a fault tolerance big data storage system. 3. Through detailed comparative experiments, it verifies the feasibility and great practicability of erasure code to store cold data in the big data storage system. It also can greatly reduce the data redundancy without affecting the availability of data. The rest of this paper is organized as follows. Section 2 is related works. Section 3 introduces the theory of P-Schedule algorithm and Sect. 4 introduces the implement. The experimental result and evaluation are described in Sect. 5. Section 6 is the conclusion.

2 Related Works 2.1

Coding Based on Matrix

Matrix is the key component of erasure coding. Both array codes and RS codes need to be encoded and decoded in the form of matrix. Coding based on matrix is essentially a dot product operation of matrices and vectors. Suppose there are k data blocks and m check blocks, and each block contains w bits. The matrix contains k + m rows and K columns. The element of the matrix is an integer in the finite field GF(2W) [13]. The matrix is called distribution matrix (DM in short). Different codes have different coding matrices. The distribution matrix and the vectors containing the data block are multiplied to a vector containing the data block and the check block. When encoding, the erasure matrix and the data vector are dot product, and then the check block can be obtained.

272

C. Yin et al.

We know that each data block corresponds to a row inside the distribution matrix. When decoding, we only need to find the corresponding rows of K data blocks which are not damaged in the distribution matrix to form an erasure matrix. The matrix is transposed and multiplied by a vector composed of data blocks without damage, and the damaged data blocks can be computed and fixed up. 2.2

Coding Based on Bit-Matrix

If we launch the big data matrix (k + m) * k in row and column direction by w in the finite field GF(2W), we can get a matrix with w * (k + m) rows and w * k columns. We call this matrix BDM (Binary Distribution Matrix), and the m * (w * w) matrix at the bottom is called CDM (Coding Distribution Matrix). We extend the vector mentioned in the previous section to wk elements. Since each element is a bit in the matrix and the vector, we can use XOR operations instead of dot product operations [15]. We set the data vectors corresponding to the bit of 1 in BDM to XOR operations. By replacing the original dot product operations with XOR operations, the speed of encoding and decoding operations can be greatly improved. Moreover, we can see that the number of XOR operations is directly related to the number of 1 in the BDM, so that we can determine the performance of the encoding by the number of 1 in BDM.

3 P-Schedule Scheme 3.1

The Principle of P-Schedule Scheme

Figure 1 shows a Bit-matrix encoding procedure for k = 3 and w = 5. Let’s analyze the calculation steps in the coding process of the matrix. The most direct way is to do 5 dot product operations, and then we convert these dot product operations into XOR operations.

Fig. 1. The encoding process based on Bit-matrix.

P-Schedule: Erasure Coding Schedule Strategy

273

However, we can see that the bit-matrix is a sparse matrix. Compared with doing encoding operation directly, it is more efficient for us to preprocess the encoding operations. We use five tuples to represent the encoding process in Eq. 1: \op; sd; sb; dd; db[

ð1Þ

We can use XOR operations instead of dot product operations and set the data vectors corresponding to the bit of 1 in BDM to XOR operations. By replacing the original dot product operations with XOR operations, the speed of encoding and decoding operations can be greatly improved. The performance of the encoding will be determined by the number of 1 in BDM. Op represents operation type, 0 is copy operation, and 1 is XOR operation. Sd is the device number of the source data. Sb is the bit number of source data. Dd and db represent the device numbers and digits of the destination data, respectively. For convenience, we unify the device number from 0 to k + m − 1. When ID i < k, it indicates the data device Di. When ID i > k, it represents the check device Ci-k. The validation process of bit-matrix in Fig. 2, which we can express in schedule, as shown in Table 1. It can be seen that schedule algorithm can effectively reduce the number of XOR operation. When we encode and decode in any bit-matrix encoding system, we should convert them into schedule to improve the efficiency of encoding and decoding.

Fig. 2. An example of bit-matrix encoding for k = 3 and w = 5. Table 1. Schedule for bit matrix operation. Schedule ,, ,, ,,, ,, ,,,

3.2

Dot product C0,0 = d0,0 C0,1 = d0,1 C0,2 = d0,2 C0,3 = d0,3 C4,1 = d0,4

⊕ ⊕ ⊕ ⊕ ⊕

d1,1 d1,2 d1,2 d1,4 d1,0

⊕ ⊕ ⊕ ⊕ ⊕

d2,2 d2,3 d1,3 ⊕ d2,4 d2,0 d2,0 ⊕ d2,1

Encoding

In the coding system, there are k data devices and m checkout devices, each of which has w bits word length. Usually, there is a matrix with w * (k + m) * wk in the finite field GF(2W) used as the erasure matrix. We select several representative erasure

274

C. Yin et al.

coding to check the performance of erasure matrix, such as Liberation, Evenodd, RDP and Cauchy Reed-Solomon. For any given w and k, the number of 1 in parity matrix of Liberation code is kw + k − 1 and that in in the unit matrix of the head of the BDM matrix is kw. The total numbers of 1 in the erasure matrix are 2kw + k − 1. We know that if the number of 1 is x in an erasure matrix contains, the XOR operation number in the encoding process is X − 1. In order to obtain a parity bit, the number of XOR operations required in Liberation code is shown as Eq. 2: 2kw þ k  1  2w k1 ¼ k  1þ 2w 2w

ð2Þ

The optimal value of Eq. 2 is k − 1. While to get a check bit, the number of XOR operations in EVENODD code is shown as Eq. 3: kw þ ðw  1Þðk  1Þ þ kw  2w 3 k1 ¼ ðk  1Þ þ 2w 2 2w

ð3Þ

As we can see, the number of XOR operations in EVENODD code is almost one and a half times than that of Liberation code. In addition, we can use the ratio of 1 in the parity matrix to compare the various codes. The proportion of 1 of Liberation code is 16%, while that of RDP code is 28% and CRS code is 21%. From the above comparison, we can see that the number of 1 in erasure matrix of Liberation code is the least, which means that we can achieve the purpose of coding through less XOR operations in the actual coding process. Therefore, the coding scheme used in this paper is based on Liberation code. 3.3

Decoding

Suppose that the data on the data node D0 and D1 is missing while k = 5, w = 5. In order to recover the missing data, we take the first ten rows of the parity matrix and transpose them. We can get the data on D0 and D1 from the transposed matrix and the remaining data nodes. The number of 1 is 134 in the ten row of the transposed matrix, so that the number of XOR operations is 134 − 10 = 124. Now, let’s check the number of 1 in the zeroth and fifth lines that are used to calculate d0,0 and d1,0, respectively. The number of 1 in the first line is 16 while that in the fifth lines 14. This means that 28 XOR operations are required through the above matrix operations. We have found that there are thirteen columns in which both of the two lines are 1. If we calculate d1,0 first, it only needs 13 XOR operations, and then we’ll calculate the d0,0 by d1,0: d0;0 ¼ d1;0  d2;0  d3;0  d4;0  p0

ð4Þ

P-Schedule: Erasure Coding Schedule Strategy

275

Equation 4 requires only 4 XOR operations. The number of XOR operations is reduced from 28 to 17 times in this way.

4 Implement 4.1

Architecture

Our system is based on Linux system, and the original backup strategy is replica. In this paper, we add erasure code to the basis of multiple replicas. The specific system framework is as Fig. 3.

Fig. 3. The architecture of the system.

Cluster management module manages the system nodes and maintains the membership among the nodes. For example, when the system nodes fail or there are new nodes added to the system, it will inform upper level and manage operations to maintain consistency information among nodes. When the application initiates the read/write request, the native gateway receives the request and positions the server node where the data block is located through the consistent hash algorithm. If it is on another server node, the request is forwarded to another node. If it is on the local node, the requested data is classified in the form of replication or in the form of erasure code. The node management module receives the request from the gateway and performs the read/write operation according to the request type.

276

C. Yin et al.

4.2

Node Schedule

In order to avoid the bottleneck, our system adopts symmetric and decentralized ring architecture. A consistent hash algorithm is used to locate in the file of the storage node. There is no longer a super node or a metadata server, and all nodes have equal status. The virtual nodes are evenly distributed on the ring after the consistency of hash algorithm of IP and port number. When we select the storage group, we can carry out the hash algorithm according to the user names of the client. We can select the first virtual node and the next N nodes according to the hash value. If a node and a selected node are on the same physical node, the node will be given up and the next will be chosen. Continue the selection to select the N nodes. As Fig. 4 shows, we suppose to choose three nodes to form a storage group. After we can carry out the hash algorithm according to the user names of the client, the first node to be obtained is A-1. We turn back to select the next two nodes, B-1 and A-2. We find that node A-2 and node A-1 are in the same physical node, so we have to give up A-2, then go backward traversal to find C-1. At this point, A-1, B-1, and C-1 are located on three different physical nodes to form a storage group.

1 2 A-1 C-2

B-1 3

A-2

A-3

4 B-2

C-1

Fig. 4. The relationship among data blocks, nodes, and hash space.

This scheme avoids the possibility of multiple faults in the scenario one, and still fails to find the appropriate storage group.

5 Evaluation 5.1

Experimental Setup

The test environment consists of three hardware servers on which a big data storage cluster are built up by virtual machines to test. The hardware configuration of each server is shown in Table 2. The most commonly used indicators to measure the reading and writing performance of a storage system are IOPS. All of the tests are based on these two indicators as test standards in this paper. In addition, we have fixed gradient values for the size of

P-Schedule: Erasure Coding Schedule Strategy

277

data blocks, 4 K, 64 K, 512 k and 1028 k respectively. In the reading and writing performance test, we test a big data storage system in different backup mode, triple replication and different erasure codes. In order to ensure the same effect between fault tolerance model and replication, we set the ratio of k and m 4:2 in erasure code model. Table 2. The parameters of test server Name CPU Memory System Disk Data Disk SSD

5.2

Parameter Intel(R) Xeon(R) CPU X5650 @ 2.67 GHz  2 Qimonda 1333 MHz 4 GB  2 WDC WD1003FBYX-0 1 TB 7200 rpm ST1000DM003-9YN1 1 TB  4 7200 rpm Seagate 600 SSD 120G MLC

Read Performance Tests

The read operation can be divided into sequential reading and random reading according to the location of each reading. Sequential reading begins at a certain location and reads backwards until the end of a position. Random reading randomly selects a location to read a small amount of data, and jumps to a random location to continue read. In theory, the speed of sequential reading is much better than that of random reading, especially in the systems using disks as storage medium.

Fig. 5. IOPS in reading.

In Fig. 5, the horizontal axis represents the size of the data block, and the vertical axis represents the sum of the IOPS values of the nodes. As we can see, IOPS is decreasing on the whole with the data block increases. This is because the time reading each block increases as the block size increases, so that the overall IOPS of the system decreases. In addition, for the same block size, the IOPS of sequential reading in the erasure code strategy is lower than that of the replication mode. The smaller the data block, the greater the IOPS gap between the two modes. When the block is larger than 4 K, the value of the erasure pattern is about 25% lower than that of the replica mode. When the block size is 1024 k, It is only about 10.4% lower than the other’s.

278

5.3

C. Yin et al.

Write Performance Tests

Figure 6 shows the sequential write of the IOPS. We can see that in the overall trend, the IOPS value becomes smaller with the data block becomes larger. However, unlike reading operations, the IOPS in erasure code strategy is larger than the IOPS value of replica strategy in writing operations. Moreover, with the data block becomes larger, the IOPS in erasure code strategy almost becomes two times that in the replica mode. This is because the amount of data written in the replica strategy is much higher than that in the erasure code strategy for the same data write request. For each data write request with 1 K, the replica strategy is written 3 K data and consumes 2 K of the network bandwidth, while the erasure code strategy only needs 1.5 k to write data and consume 1.25 k network bandwidth.

Fig. 6. IOPS in writing.

6 Conclusion Erasure code is an effective way to solve data redundancy. It can achieve the same fault tolerance with the data redundancy far below the replica strategy. However, there is a lack of data availability in the big data system with erasure code strategy, which is inconsistent with the real-time data required by the user. Considering the cold data caused by the massive waste of storage space and data availability, this paper presents the special schedule strategy of erasure coding storage for cold data. We have proposed to improve cold data storage capacity of big data storage system by using Liberation code. The experiments show that erasure code can greatly reduce the data redundancy without affecting the availability of data. Decoding in the erasure code system will consume a large amount of network bandwidth, which is also a factor restricting the use of erasure codes in big data storage systems. Although regenerative codes have been proposed to solve the problem of network bandwidth to some extent, this is achieved by sacrificing storage efficiency. How to optimize the decoding bandwidth without sacrificing the storage efficiency is also a research direction. Acknowledgements. This work was supported by National Natural Science Foundation of China (No. 61662038), Science and technology project of Jiangxi Provincial Department of Education (No. GJJ151081), the Visiting Scholar Funds by China Scholarship Council, the JiangXi Association for Science and Technology.

P-Schedule: Erasure Coding Schedule Strategy

279

References 1. Morris, R.J.T., Truskowski, B.J.: The evolution of storage systems. IBM Syst. J. 42(2), 205– 217 (2003) 2. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1–21 (2015) 3. Schermann, M., Hemsen, H., Buchmüller, C., Bitter, T., Krcmar, H., Markl, V., Hoeren, T.: Big data. Bus. Inf. Syst. Eng. 6(5), 261–266 (2014) 4. Chen, Y., Chen, H., Gorkhali, A., Lu, Y., Ma, Y., Li, L.: Big data analytics and big data science: a survey. J. Manag. Anal. 3(1), 1–42 (2016) 5. Li, S., Cao, Q., Wan, S., Qian, L., Xie, C.: HRSPC: a hybrid redundancy scheme via exploring computational locality to support fast recovery and high reliability in distributed storage systems. J. Netw. Comput. Appl. (2015) 6. Calder, B., Wang, J., Ogus, A., et al.: Windows Azure storage: a highly available cloud storage service with strong consistency. In: Proceeding of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 143–157 (2011) 7. Chun, B.G., Dabek, F., Haeberlen, A., et al.: Efficient replica maintenance for distributed storage systems. In: Proceedings of NSDI, pp. 225–264 (2006) 8. Chen, P.M., Lee, E.K., Gibson, G.A., et al.: RAID: high-performance, reliable secondary storage. ACM Comput. Surv.–CSUR 26(2), 145–185 (1994) 9. Corbett, P., English, B., Goel, A., et al.: Row-diagonal parity for double disk failure correction. In: FAST 2004: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pp. 1–14 (2004) 10. Xiang, L., Xu, Y., Lui, J., et al.: Optimal recovery of single disk failure in RDP code storage systems. In: SIGMETRICS 2010 Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 119–130 (2010) 11. Blaum, M., Brady, J., Bruck, J., et al.: EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44(2), 192–202 (1995) 12. Huang, C., Xu, L.: STAR: an efficient coding scheme for correcting triple storage node failures. IEEE Trans. Comput. 57(7), 889–901 (2008) 13. Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1996) 14. Rodrigues, R., Liskov, B.: High availability in DHTs: erasure coding vs. replication. In: Castro, M., van Renesse, R. (eds.) IPTPS 2005. LNCS, vol. 3640, pp. 226–239. Springer, Heidelberg (2005). https://doi.org/10.1007/11558989_21 15. Luo, J., Bowers, K.D., Oprea, A., Xu, L.: Efficient software implementations of large finite fields GF(2n) for secure storage applications. ACM Trans. Storage 8(2) (2012)

Answer Aggregation of Crowdsourcing Employing an Improved EM-Based Approach Ran Zhang ✉ , Lei Liu, Lizhen Cui, Wei He, and Hui Li (

)

Shandong University, Jinan, Shandong, China [email protected], [email protected]

Abstract. Crowdsourcing platforms are frequently employed to collect answers from numerous participants on the Internet, e.g., Amazon Mechanical Turk. Different participants may have different answers for the same question. This cause unexpected aggregated answers. The accuracy of aggregated answers depends on answer quality. Answer quality varies by skill level of participants. In crowd‐ sourcing, participants are defined as workers. Existing studies always characterize worker quality with their skills. However, the personality features of individual persons may have significant impact on the quality of their answers, e.g. worker emotion and worker intent. To this end, aggregating answers without taking into account the personality characteristics of persons may lead to unexpected results. To fill the gap this paper employs an improved EM-based approach for answer aggre‐ gation based on the answer data of workers and considering personality characteris‐ tics. The approach not only aggregates answers but also simultaneously estimates the skill level of each worker, worker emotion, worker intent and the difficulty of the task. Last but not least, the verification is conducted on real-world datasets Affect Text and simulation datasets. Keywords: Crowdsourcing · Worker skill · Task difficulty · Worker quality Personality characteristics · EM-based approach · Answer aggregation

1

Introduction

Crowdsourcing is a distributed problem-solving solution, which aids computers in completing tasks that computers cannot solve on their own [1]. There are many crowd‐ sourcing platforms e.g., Amazon Mechanical Turk, Crowd Flower, www.zbj.com, and www.weichaishi.com. Crowdsourcing platform publishes tasks e.g., sentiment labeling task [17] form requesters and collect answers from workers. Hundreds of workers on such platforms can accept tasks and send back the corresponding answers. Based on the collected answers, aggregated answers can be obtained through some aggregation algo‐ rithm. The accuracy of the aggregated answers depends on answer quality. Answer quality varies due to the difference of skill level, intent and emotion of workers. Existing works always study the influence of skill level and worker intent on the answer quality and ignores the personality characteristics of persons e.g., worker emotion. Therefore, in order to obtain the aggregated answers, this paper takes worker skill, worker emotion and worker intent into consideration. Besides, difficulty of the task is also taken into account. © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 280–290, 2018. https://doi.org/10.1007/978-3-030-05057-3_23

Answer Aggregation of Crowdsourcing Employing

281

Personality characteristics of persons are important for model worker quality. There are factors that have impact on answers of workers. For tasks, each task has its own difficulty level, which has impact on the judgment of workers. Due to different characters of workers, they have different worker skill and worker intent. In addition, emotion as the personality characteristics of persons will also have impact on the accuracy of answers of workers [11]. Some researchers [8] find that workers who are in positive emotion are more productive than workers who are in negative emotion. Although there have been many studies that have considered worker quality, it hardly find studies which takes into account worker emotion. In this paper, in order to aggregating answers from workers, worker emotion is taken into consideration. The improved answer aggregation approach is based on the EM algorithm in this paper. In this method, the influence of worker intent, worker emotion, worker skill and task difficulty on answers of workers is considered, and according to this method, aggregated answers are obtained. The EM-based method in this paper is called the four-parameter EM approach. In this model, workers with different emotions and different intent are formu‐ lated based on workers’ behavior. This paper defines three types of workers: workers who are non-malicious and are in positive emotion, workers who are non-malicious but are in negative emotion, and workers who are malicious that answer only for money. A generalized Expectation-Maximization algorithm [9, 10] is used to perform param‐ eter estimation in this paper. The improved EM-based approach is divided into two steps which performs iteratively until convergence of the parameter set: (1) Expectation step: use the existing estimates for worker skill, worker emotion, worker intent and task difficulty of the parametric probabilistic model for aggregated answers to calculate the expectation of aggregated answers; (2) Maximization step: find the worker skill, worker intent, worker emotion and task difficulty that make the expected log likelihood maximize. When the parameters are converged, the probability distribution of the aggregated answers is also constant, so the aggregated answers of the task can be obtained, as well as the worker skill, worker intent, worker emotion and task difficulty at that time. Experiments show that this method is effective. Contributions of this paper are as follows: • This paper aggregates answers considering personality characteristics of persons e.g., worker emotion. It hardly finds works that takes into account it. And it is challenging to evaluate it. To this end, a method for evaluate personality characteristics of persons can be provide, which considers the influence of worker emotion; • This paper develops an improved EM algorithm with four parameters, which are worker emotion, worker intent, worker skill and task difficulty, respectively. The method combines these four parameters to evaluate the worker quality, and mean‐ while obtains the aggregated answers. This rest of the paper is organized as follows. Section 2 is related work on answering aggregation and quality control. In Sect. 3, the paper will describe in detail the improved EM-based method, which is a method with four parameters. Section 4 is validation and the verification result of the improved method on the simulated dataset and real-world dataset. The real-world dataset is a sub-dataset of Affect Text. Section 5 concludes this paper.

282

2

R. Zhang et al.

Related Work

Collecting answers from numerous workers through crowdsourcing platforms has been widely accepted. To obtain aggregated answers, answer aggregation approach is often used. Majority voting [5, 6] is a simple approach to obtain the aggregated answer. However, majority voting has the assumption that all workers have same accuracy. In fact, workers have different individual accuracy, since they have different characters, e.g., worker skill, worker intent, worker emotion. To fill the gap, methods considering different characters of workers arise. In [3], Cao et al. utilize weighted majority voting to aggregate answers of workers, which considers accuracy of workers based on the history information of workers answers. In [7], Demartini et al. propose a probabilistic model which is based on factor graph considering answers of workers and accuracy of answers. Both [3] and [7] use historical answers of workers to analyze individual accuracy of workers. In [13], Koulougli et al. propose accumulative weighted voting method which takes into account both uncertainty and skill levels. In [12], Sun et al. propose a probabilistic model for quantitative crowd‐ sourcing problem by considering the changing of worker ability so as to achieve better quality control. Both [13] and [12] consider worker skills in analyzing accuracy of answers of workers. In addition, worker intention also has impact on answers of workers. For example, spammers in crowdsourcing system may choose answers randomly for the financial reward. To distinguish spammers and non-spammers, Kurve et al. [2] add the worker intention as a parameter into the EM-based algorithm for identifying malicious workers. Kurve thinks malicious workers will choose the wrong answers when they know the correct one or they will choose answers randomly. In [14], Moayedikia et al. propose a reliability estimation algorithm which relies on Gaussian process model, based on bee colony algorithm to distinguish spammers and non-spammers. Both [2] and [14] consider worker intention in analyzing answers of workers. In addition to the factors mentioned above that have impact on the accuracy of answers of workers, there are other factors. In [15], Wu et al. propose a novel ground truth inference algorithm which is based on EM algorithm and aggregates answers. Wu considers the reliability of each worker and the difficulty of each instance. Algorithm GLAD [16] takes the difficulty of instances into consideration and adopts a logistic regression model for inference. [15] and [16] considers the influence of task difficulty on the answer accuracy of workers. In [11] Yu et al. leverage the relationship between worker emotion and their productivity to schedule work time of workers for high answer quality. Yu considers personality characteristics of persons of workers. Some researchers [8] find that workers who are in positive emotion are more productive than workers who are in negative emotion. In this paper, factors such as worker skill, task difficulty, worker intent and worker emotion are all taken into account to aggregate answers.

Answer Aggregation of Crowdsourcing Employing

3

283

Crowdsourcing EM-Based Approach with Four Parameters

3.1 Four Parameters in the Improved EM-Based Approach Four parameters in the algorithm will be mentioned here. Worker skill, task difficulty, worker intent and worker emotion are taken into account here. Continuous parameters are used to describe worker skill and task difficulty. ki ∈ (−∞, +∞) represents worker skill of worker j; dj ∈ (−∞, +∞) represents the difficulty level of task i. Binary param‐ eters are used to denote workers emotion and worker intent. (mj , wj ) ∈ {(1, 1), (0, 1), (1, 0), (0, 0)}, where mj indicates the emotion of the worker j, and wj indicates the intent of the worker j. As can be seen in Table 1, there are types of workers based on combination of worker emotion and worker intent. (1, 1) denotes that workers are in positive emotion and have non-malicious intent. They are called PN workers. (0, 1) denotes workers are in negative emotion but have malicious intent. They are called NN workers. (1, 0) and (0, 0) indicate that the worker is malicious. Whether the worker emotion is positive or negative, they are all called MM workers. PN workers tend to answer questions correctly at the best level of ability. NN workers tend to answer as accurately as possible, but negative emotion will reduce their accuracy to answer questions. The influence factor q is used to indicate the extent of the effect of emotion of workers. MM workers tend to answer at random, regardless of their emotion. Table 1. Types of workers Worker emotion Positive: 1 Negative: 0

Worker intent Non-malicious: 1 Malicious: 0 PN MM NN MM

3.2 Expectation Step of the Improved EM-Based Approach Suppose there are a group of workers to answer Tn non-probe tasks. Each worker answers at least one task and each task receives at least one answer from workers. The probe task is a task with known ground truth. Let Mi denote the number of workers answering the task i. There are Tp tasks that are also published to workers, but workers do not know that these are probe tasks. Therefore, workers with low accuracy cannot escape detection of the as the index of probe tasks, Take {1, 2, 3 … , Tp}} {system. Tp + 1, Tp + 2, Tp + 3 … , Tp + Tn {is the index of }non-probe tasks. Answers for task i is chosen from the set of options Oi ≡ 1, 2, 3, … , Ci . Let zi ∈ Oi be ground truth answer to task i. Let rij ∈ Oi be answer for the task i from the worker j. Referring to stochastic generative model for generating answers of workers based ) } { [2], } the } {{(on worker behavior parameter set of four-parameter model is defined as: 𝛺 = wj , mj , kj , qj ∀j , di ∀i . In order to define the probability model of ground truth, this paper gives the proba‐ bility distribution of answers of workers based on answering behaviors of workers of different types. Based on the difference between the worker skill and the difficulty of the task, use the sigmoid function to model the probability that the workers will answer

284

R. Zhang et al.

1

. qj ∈ (0, 1) is indicates the probability that negative 1+e emotion has impact on accuracy of workers.

the tasks correctly:

−(kj −di )

PN Workers Probability mass function φ is defined as (1). It expresses the probability that the answer of the worker is rij when the emotion of the worker is positive and the intent is nonmalicious. In this scenario, the possibility for workers to answer correctly is only decided between the worker skill and task difficulty. The greater the value of (by the difference ) kj − di is, greater the probability to answer correctly, the value of the 𝜑 will tend to 1 . Ci ( ) ) ( 𝜑 rij = l|𝛺ij , mj , wj = (1, 1), zi

( )( − k −d ) ⎧ 1 1 e ( j i) ⎪ + for l = zi ⎪ 1 + e−(kj −di ) −(kj −di ) C = ⎨ ( )( −(k −d ) i ) 1 + e e j i ⎪ 1 otherwise −(kj −di ) ⎪ Ci 1 + e ⎩

(1)

NN Workers Probability mass function φ is defined as (2).

( ) ) ( 𝜑 rij = l|𝛺ij , mj , wj = (0, 1), zi ( )( − k −d ) ⎧ 1 1 e ( j i) ⎪ + for l = zi ⎪ 1 + e−(kj −di ) −(kj −di ) C ) ( i )( 1 +−ek −d =⎨ qj e ( j i) 1 ⎪ otherwise + −(kj −di ) Ci ⎪ 1 + e−(kj −di ) 1 + e ⎩

(2)

When a worker is in negative emotion but has non-malicious intent to answer the question, his ability to answer correctly will be reduced partly. Symbol q is used to denote the influence probability. MM Workers When a worker has malicious intent, no matter he is in positive emotion or in negative emotion, he tends to choose the answer randomly. Probability mass function 𝜑 is defined as:

Answer Aggregation of Crowdsourcing Employing

285

) ( 𝜑 rij = l|𝛺ij , wj = 0, zi ( )( − k −d ) ⎧ qj 1 e ( j i) ⎪ + for l = zi ⎪ 1 + e−(kj −di ) Ci 1)( + e−(kj −di ) ) ( =⎨ −(kj −di ) 1 1 ⎪ e otherwise + −(kj −di ) Ci − 1 ⎪ 1 + e−(kj −di ) 1 + e ⎩

(3)

Based on the generation model above and Bayes rule, the posterior probability mass function is denoted as (4), which can also be called as probability distribution of

( ) t 𝜑 r |𝛺 , z = c ij ij i j=1 ( ) Pi Zi = c|X, 𝛺t = ∑ ∏ ( ) Ci Mi t 𝜑 r |𝛺 , z = c ij i ij l=1 j=1 } { ∀c, ∀i ∈ Tp + 1, … , Tn + Tp ∏Mi

(4)

the ground truth of non-probe tasks, where 𝛺t is current parameter set, c ∈ {1, 2, 3 … , Ci } and X denotes the observed data containing workers answers to every tasks and the ground truth answer of probe tasks. This paper treats ground truth of non( ) probe tasks Zn as latent variables in our EM model. 3.3 Maximization Step of the Improved EM Algorithm There are observed data and unobserved data here, observed data is denoted as X above, ( ) ( ) the ground truths of non-probe tasks Zn . X and X, Zn is called incomplete data and complete data respectively. To estimate the parameter set 𝛺, the incomplete data loglikelihood should be calculated. The incomplete log-likelihood is obtained by iteratively maximizing the expectation of the complete log-likelihood Q(𝛺|𝛺t ). Based on genera‐ ) ( tion model above and P Zi = c|X, 𝛺t obtained in expectation step, the expected complete data log-likelihood Q(𝛺|𝛺t ) can be written as (5).

286

R. Zhang et al.

( ) Q Ω|𝛺t = E[logLC |X, Zn , 𝛺t ] ) ∑T +T ∑M ( ) ∑Tp ∑Mi ∑Ci ( p n i ∝ log φ rij |𝛺ijt , zi + log φ rij |𝛺ijt , zi = c i=1 j=1 i=Tp +1 j=1 c=1 ∑Tp ∑ ( ) {wj [mj log𝜑(rij |𝛺ijt , mj , wj = (1, 1), zi = rij ) = i=1 j:rij =zi ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi = rij )] ( ) ) ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi = rij )} ∑Tp ∑ ( ) {wj [mj log𝜑(rij |𝛺ijt , mj , wj = (1, 1), zi ≠ rij ) + i=1 j:rij ≠zi ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi ≠ rij )] ) ) ( ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi ≠ rij )} ) ( ∑Tp +Tn ∑Ci ∑ ( ) ( ) P zi = c {wj [mj log𝜑 rij |𝛺ijt , mj , wj = (1, 1), zi = c + i=Tp +1 c=1 j:rij =c ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi = c)] ) ) ( ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi = c)} ) ( ∑Tp +Tn ∑Ci ∑ ( ) ( ) P zi = c {wj [mj log𝜑 rij |𝛺ijt , mj , wj = (1, 1), zi ≠ c + i=Tp +1 c=1 j:rij ≠c ( ( ) ) + 1 − mj log𝜑(rij |𝛺ijt , mj , wj = (0, 1), zi ≠ c)] ( ) ) ( + 1 − wj log𝜑(rij |𝛺ijt , mj , wj ∈ {(0, 0), (1, 0)}, zi ≠ c)}.

(5)

( ) kj ∀j and di ∀i are continuous parameters, mj , wj ∀j are discrete parameters, and there are 3M (three types of workers) crowd configurations, it is infeasible to find a closed solution for Ωt+1 = arg maxΩ Q(Ω|Ωt ). Therefore, the maximization step of this improved EM algorithm is divided into two steps: continuous parameters calculation and discrete parameters calculation, and maximize these two steps iteratively.

Discrete Parameters Calculation

( ) In this sub-step, a closed solution for mj , wj ∀j will be found with kj ∀j, qj ∀j and di ∀i ( ) fixed. m ̃ j , w̃ j denotes the result of the worker emotion and worker intent in these discrete parameters calculation sub-step. 𝛺̃ denotes the result of previous continuous parameters calculation sub-step. Genetic algorithm is used here.

Continuous Parameters Calculation

)} )} {( {( The expectation of the complete log-likelihood E[logLC ∖ mj , wj |Xj , mj , wj ] is ( ) calculate in this sub-step to find kj ∀j, qj ∀j and di ∀i that can maximize it, with mj , wj ∀j ( ) fixed. The value of mj , wj ∀j is from previous concrete parameters calculation sub-step. There is a gradient ascent being used to find a local maximum for kj ∀j, qj ∀j and di ∀i, which is performed until the change of likelihood between the two gradient steps falls

Answer Aggregation of Crowdsourcing Employing

287

below a certain threshold. Two sub-steps are performed iteratively until the convergence of the parameters in 𝛺. Then the results of maximization step are stored in 𝛺t+1. Expectation step and maximization step are performed iteratively until the change of the expectation of the likelihood between the two steps falls below a certain threshold. The EM algorithm with four parameters is guaranteed to find a local solution.

4

Validation

Validation on Simulated Dataset Simulated data generated by the above generation model is used. Firstly, a group of 10 workers are generated with kj ∼ N(1, 1000); 10% of workers have malicious intent, and 10% of workers with non-malicious intent are in negative emotion. This paper only considers the effect of negative emotion on non-malicious workers. The tasks are gener‐ ated with di ∼ N(20, 500). The emotion influence factor qj ∈ (0, 1) follows normal random distribution. The ground truth for each task is chosen randomly from {1, 2, 3}. To make the result more clearly to observe, the results of worker skill and task difficulty are fitted by the least squares fitting method respectively. In this paper, based on the above probability distribution of worker skill and task difficulty, comparison values of the worker skill and task difficulty are separately generated. They are actual values which are used to generate simulated data. Four-parameter algorithm and Three-parameter algorithm (EM-based algorithm of Kurve) are utilized respectively to obtain the esti‐ mated values for worker skill and task difficulty. Figure 1 shows the comparison of estimated values of worker skill with actual value. Figure 2 shows the comparison of estimated values of task difficulty with actual value. The solid line in the figure denotes values estimated through Four-parameter algorithm. The dotted line with dot denotes values estimated through Three-parameter algorithm. The last line denotes actual value. Figures 1 and 2 show highly consistent trends between the estimated value of worker skill and task difficulty and the actual value.

Fig. 1. The comparison of estimated values of worker skill with actual values

288

R. Zhang et al.

Fig. 2. The comparison of estimated values of task difficulty with actual values

For better explanation of two figures, two indicators are used: NSE and RMSE. NSE (Nash-Sutcliffe efficiency coefficient) measures the fitting degree of values estimated through Four-parameter algorithm with actual values. As long as the value of the NSE is between 0 and 1, the fitting effect of the model is credible. For worker skill, the NSE value of Four-parameter algorithm is 0.49. For task difficulty, its NSE is 0.58. It indicates Four-parameter algorithm is credible for estimating task difficulty and worker skill. Credible estimated worker skill and task difficulty can be obtained through Four-param‐ eter algorithm. As shown in Table 2, RMSE (root-mean-square error) of Four-parameter algorithm for workers skill is 7.04, RMSE of Three-parameter algorithm is 11.33, which indicates the fitting effect between worker skill estimated through Four-parameter and actual values is well. For task difficulty, MSE of Four-parameter algorithm is 10.08, and RMSE of Three-parameter algorithm is 11.40. It also indicates the fitting effect between task difficulty estimated through Four-parameter and actual values is well. Table 2. RMSE of estimated values and actual values Parameters Worker skill (Fig. 1) Task difficulty (Fig. 2)

Four-parameter algorithm 7.04 10.80

Three-parameter algorithm 11.33 11.40

Accuracy measures the degree of the estimate of the algorithm for the aggregated answers. The accuracy of Four-parameter algorithm is 0.86 which is as good as that of Three-parameter algorithm. It indicates Four-parameter algorithm can obtain accurate aggregated answers. Results of the experiment show that the four parameter algorithm is not worse than three-parameter algorithm in estimating worker skill and task difficulty as well as aggregating answers in the context of this paper. Validation on Real-World Dataset Affect Text Dataset. The Affective Text dataset is collected as a sentiment labeling task proposed by Strapparava et al. [17]. They employ workers to rate the title of a piece of

Answer Aggregation of Crowdsourcing Employing

289

news for a few types of emotions and a comprehensive score (Valence) to indicate the entire emotion of this news. Snow et al. [18] select a set of 100 samples from the SemEval set and obtain 1000 scores for each emotion and Valence scores. For each emotion, workers provide a score in a range of [0, 100]. For the score Valence, workers provide a score in a range of [−100, 100]. This paper maps the Valence score to two classes, three classes, four classes and five classes respectively. We obtain four sub-datasets: Two classes dataset, Three classes dataset, Four class dataset and Five classes dataset. Figure 3 shows the comparisons between the developed method and ZenCrowd [7], KOS [4] as well as the base line Majority voting (MV). The dataset used is sub-dataset Valence of Affect Text dataset. As mentioned, the sub-dataset Valence is divided into four sub-datasets based on the number of the class. In Fig. 3, the four- parameter algo‐ rithm performs well on three classes, four classes and five classes. The experimental comparisons show that the four-parameter algorithm performs well on multiple category sentiment labeling tasks.

Fig. 3. The comparison of accuracy of the aggregated answers between MV, ZenCrowd, KOS and OurAlgo.

5

Conclusions

Answer aggregation considering worker quality is a useful tool for aggregating answers of workers. The aggregated answers depend on quality of answers. Answer quality varies through worker skill, worker intent and worker emotion. In this paper, an improved method of answer aggregation is utilized to obtain the aggregated answers. The approach is based on EM algorithm. The improved EM-based method not only can obtain aggre‐ gated answers considering worker quality but also can simultaneously estimate the worker skill, worker emotion, worker intent and the difficulty of each task. Taking into account the three factors that have impact on the answers of workers, more accurate analysis of the possibility of workers answering correctly is achieved, so as to obtain aggregated answers considering worker quality. Verification is performed on the simu‐ lated dataset and real-world dataset. Compared to other method, the improved method is efficient on multiple category sentiment labelling tasks. It can obtain more accurate result in that scenario.

290

R. Zhang et al.

Acknowledgment. This work is partially supported by National Key R&D Program No. 2017YFB1400100, SDNFSC No. ZR2018MF014.

References 1. Feng, J.H., Li, G.L., Feng, J.H.: A survey on crowdsourcing. Chin. J. Comput. 38(9), 1713–1726 (2015) 2. Kurve, A., Miller, D., Kesidis, G.: Multicategory crowdsourcing accounting for variable task difficulty, worker skill, and worker intention. IEEE Trans. Knowl. Data Eng. 27(3), 794–809 (2014) 3. Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? Proc. VLDB Endow. 5(11), 1495–1506 (2012) 4. Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems (2011) 5. Lee, J., Cho, H., Park, J.W., Cha, Y.R., Hwang, S.W., Nie, Z., Wen, J.R.: Hybrid entity clustering using crowds and data. VLDB J. 22(5), 711–726 (2013) 6. Park, H., Garcia-Molina, H., Pang, R., Polyzotis, N., Parameswaran, A., Widom, J.: Deco: a system for declarative crowdsourcing. Proc. VLDB Endow. 5(12), 1990–1993 (2012) 7. Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: International Conference on World Wide Web, pp. 469–478. ACM (2012) 8. Oswald, A., Proto, E., Sgroi, D.: Happiness and productivity. Soc. Sci. Electron. Publ. 33(4), 789–822 (2008) 9. Dempster, A.P., Laird, L., Rubin, D.B.: Maximum likelihood estimation from incomplete data via the EM algorithm. Elearn 39(1), 1–38 (1977) 10. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., et al.: Learning from crowds. J. Mach. Learn. Res. 11(2), 1297–1322 (2010) 11. Yu, H., Shen, Z.J., Fauvel, S., Cui, L.Z.: Efficient scheduling in crowdsourcing based on workers’ emotion. In: IEEE International Conference on Agents IEEE Computer Society, pp. 121–126 (2017) 12. Sun, H., Hu, K., Fang, Y., Song, Y.: Adaptive result inference for collecting quantitative data with crowdsourcing. IEEE Internet Things J. 4(5), 1389–1398 (2017) 13. Koulougli, D., Hadjali, A., Rassoul, I.: Leveraging human factors to enhance query answering in crowdsourcing systems. In: IEEE Tenth International Conference on Research Challenges in Information Science, pp. 1–6. IEEE (2016) 14. Moayedikia, A., Ong, K.L., Boo, Y.L., Yeoh, W.: Bee colony based worker reliability estimation algorithm in microtask crowdsourcing. In: IEEE International Conference on Machine Learning and Applications, pp. 713–717. IEEE (2017) 15. Wu, M., Li, Q., Zhang, J., Cui, S., Li, D., Qi, Y.: A robust inference algorithm for crowd sourced categorization. In: International Conference on Intelligent Systems and Knowledge Engineering, pp. 1–6 (2017) 16. Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., Movellan, J.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: International Conference on Neural Information Processing Systems, vol. 46, pp. 2035–2043. Curran Associates Inc. (2009) 17. Strapparava, C., Mihalcea, R.: SemEval-2007 task 14: affective text. In: International Workshop on Semantic Evaluations, pp. 70–74. Association for Computational Linguistics 18. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Conference on Empirical Methods in Natural Language Processing 2008 (2008)

Internet of Things and Cloud Computing

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data Using Apache Spark in Cloud Cheng Yang1 , Weidong Bao1 , Xiaomin Zhu1,2(B) , Ji Wang1 , and Wenhua Xiao1,3 1

2

National University of Defense Technology, Changsha, China [email protected] State Key Laboratory of High Performance Computing, Changsha, China 3 Academy of Military Sciences, Beijing, China

Abstract. In the field of signal process, Fast Fourier Transform (FFT) is a widely used algorithm to transform signal data from time to frequency. Unfortunately, with the exponential growth of data, traditional methods cannot meet the demand of large-scale computation on these big data because of three main challenges of large-scale FFT, i.e., big data size, real-time data processing and high utilization of compute resources. To satisfy these requirements, an optimized FFT algorithm in Cloud is deadly needed. In this paper, we introduce a new method to conduct FFT in Cloud with the following contributions: first, we design a parallel FFT algorithm for large-scaled signal data in Cloud; second, we propose a MapReduce-based mechanism to distribute data to compute nodes using big data processing framework; third, an optimal method of distributing compute resources is implemented to accelerate the algorithm by avoiding redundant data exchange between compute nodes. The algorithm is designed in MapReduce computation framework which contains three steps: data preprocessing, local data transform and parallel data transform to integrate processing results. The parallel FFT is implemented in a 16-node Cloud to process real signal data The experimental results reveal an obvious improvement in the algorithm speed. Our parallel FFT is approximately five times faster than FFT in Matlab in when the data size reaches 10 GB. Keywords: Fast fourier transform Apache spark · Parallel algorithm

1

· Cloud computing

Introduction

Target detection usually employs some traditional methods such as radar detection to detect aerial targets [14,28]. However these methods are not available when the signal from aerial aircrafts is weak. Fortunately, utilizing spatial electric signal from satellites to detect targets is a feasible developing approach to c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 293–310, 2018. https://doi.org/10.1007/978-3-030-05057-3_24

294

C. Yang et al.

detect aerial targets [10,13]. Since aircrafts would reflect the signal from satellites, the ground receiving station gets two different signals (the pure signal from satellites directly and the reflect signal reflected by aircrafts). By making analysis and comparison between pure-signal and reflect-signal, the position information about aerial targets can be obtained. It should be noted that the process of comparison, a huge quantity of data (3 TB) need to be processed in one hour in real-time, which requires the data processing systems in the back to have the capability to perform computations for large-scale signal data in time. Specifically in signal comparison, numerous data are needed to be processed ,which generates tremendous intermediate data at the same time. In this process, Fourier transform plays a significant and indispensable role [18]. Fourier transform decompose a function of time into frequency [15,26]. Discrete Fourier Transform, as one algorithm in the series of Fourier transforms, is widely used to detect the features of received signals. From these features, the target’s information can be obtained. However, Discrete Fourier Transform has a great amount of calculation which results in low efficiency. Fast Fourier Transform (FFT) algorithm, proposed by Cooley and Turkey, simplifies and accelerates the Discrete Fourier Transform effectively. [6] It successfully reduces the complexity of Discrete Fourier Transform from N *N to N *logN . Although Fast Fourier Transform is more efficient than Discrete Fourier Transform, when the data scale becomes giant, this conventional algorithms cannot solve the signal processing problem effectively. The FFT algorithm is not only used for signal processing but also applied to many other field, i.e., image processing [20,22], spectral analysis [23], data compression [12], and so on. Improving the efficiency of FFT algorithm on big data can be beneficial to many research field. Due to the importance of processing such big-scale data, a wide variety of approaches are designed to optimize the performance of signal processing [31]. Among these methods, parallel Fast Fourier Transform is a unique approach since it enables the algorithm implemented on multi-machines. Furthermore, with the fast improvement of Cloud computing technology [1,2], the thought of parallel FFT can be implemented in big data processing frameworks. To the best of our knowledge, there is little work to conduct parallel FFT with big data processing frameworks in Cloud. We use Apache Spark to optimize the real-time FFT job. Apache Spark is an efficient parallel big data processing framework [21,30]. It derived from the conventional Cloud computing framework, MapReduce that repeatedly read and write data from an external stable storage system [32]. Nonetheless, when an application needs to frequently reuse the intermediate data, MapReduce becomes inefficient. Apache Spark presents a new strategy to avoid such futile read or write operations on disks. It introduces Resilient Distributed Dataset (RDD), an unique distributed memory abstraction that enables data to be stored in memory. By this way, the speed of cyclic computation work is greatly improved. There is a close correlation between Fast Fourier Transform and Apache Spark. Iteration and Parallelization are two main properties of FFT, which makes Apache Spark suitable for FFT. First, as FFT intensely generates and reuses

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

295

the intermediate data, immense read and write operations are unavoidable in conventional method. To solve this problem, Apache stores intermediate data in memory so that it performs such iterative computation efficiently [29]. Second, inside each step of FFT, the Discrete Fourier Transform conducts computation separately on data, which makes it feasible to parallelize FFT on Apache Spark. Simply using Apache Spark to implement parallel FFT algorithm is not sufficient for big data processing. The background computing system is also needed to be suitable. In order to improve the utilization of resources, a strategy to optimally allocate compute resources to each node is proposed. We design two resource allocation strategies for parallel FFT. The equally-split strategy provides a simple method to make full use of compute resources. The optimizedsplit strategy is designed to improve the efficiency by reducing the data exchange between compute nodes, which further improves the resource utilization. The major contributions of this paper are as follows: – A MapReduce-based mechanism to efficiently distribute signal data to compute nodes. The MapReduce process contains three steps: data preprocessing (map data to compute nodes), local data transform and parallel data transform to integrate processing results (collect results). – A parallel approach to implement Fast Fourier Transform based on Apache Spark. The parallel approach of FFT provides an effective method to utilize more compute resources. – Optimized strategies to allocate compute resources in the Cloud for parallel FFT for high speed computation. During the process of parallel FFT, there are many redundant data exchanges between compute nodes. Our allocation strategies provide methods to reduce the data exchanges. The remainder of the paper is organized as follows. The next section reviews related work in the literature. Section 3 formally describes the system model of computation Cloud we designed. This is followed by Sect. 4, the framework of parallel Fast Fourier Transform algorithm. The allocation strategy of computation resource is given in Sect. 5. Section 6 depicts the performance development of the algorithm. Section 7 concludes the paper with a summary and future work.

2

Related Work

Since Cooley and Tukey [6] firstly introduced Fast Fourier Transform, FFT has had a substantial influence on the area of signal processing. FFT algorithm provides an efficient method for the Fourier analysis to produce spectrograms. Unfortunately, with the exponential growth of data, the original FFT algorithm cloud not meet the computation demand gradually. Therefore, many approaches had been proposed to improve the speed of FFT. Interests were arisen both in finding efficient implementation of FFT and in improving the algorithm itself. On the one hand, a variety of study focused on faster algorithms to improve the inner computation process of FFT. Preuss [19] proposed a radix-2 Fast

296

C. Yang et al. Table 1. Summary of the Main Notation used throughout the paper Symbol Description N

Total number of signal points in input data

p

Number of compute nodes

Dkl

The lth data set in stage k

Ck

The lth compute node in a Cloud

xk

Input data array in a Fourier Transform

Ek

Even number part of input data array

Ok

Odd number part of input data array

Xk

Result data array in a Fourier Transform

e

Natural base

Tk

Time of the kth stage in the algorithm

Ttotal

Total time of the algorithm

n

Number of cores in the Cloud

m

Cache size(GB) of the Cloud

Fourier Transform algorithm which reduces the number of multiplications to twothirds of the effort required by most radix-2 algorithms. Frigo et al. [8] proposed a FFT program that tunes the computation automatically for any particular hardware which performs significantly better than other softwares. Mullin [16] employed the use of monolithic array analysis as a way to remove the constraints imposed on performance by a machine’s underlying hardware to accelerate FFT algorithm. In our study, we choose to parallelize the radix-2 Fast Fourier Transform algorithm, which is widely used in most signal-processing area. On the other hand, many approaches were proposed to parallelize the computation of the FFT algorithm. Githens et al. proposed a framework called Parallel Element Processing Ensemble to conduct signal processing [9]. Based on this framework, Bergland introduced a parallel implementation of Fast Fourier Transform that segments the algorithm into groups of identical parallel operations [4]. Wold devised a method to implement parallel FFT in VLSI [27]. Since Google introduced Hadoop [3], many efficient efforts have been proposed to process the data on the efficient Cloud Computing architecture. Hassen et al. [11] distributed the FFT feature extraction techniques using the MapReduce programming model in Cloud Computing environment. Vincke et al. [24] concluded several parallel software design patterns to calculate Fast Fourier Transform, such as MapReduce, Divide-and-Conquer, Fork/Join, and so on. Besides FFT, a variety of researches have been conducted to improve the performance of Cloud computing. Dean et al. [7] introduced the MapReduce programming model to separate large-scale data into partitions and parallelize the computation across large-scale Clouds. Wang et al. [25] proposed a system to combine long-running VM service with typical batch workload like MapReduce.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

297

Fig. 1. The process of parallel FFT in Cloud

Palanisamy [17] proposed a MapReduce resource allocation system aimed at enhancing the performance of MapReduce jobs in the Cloud. Zaharia et al. [29] designed the resilient distributed datasets and Apache Spark that uses cache memory to conduct computations. Zadeh et al. [5] proposed feasible matrix computation methods on Apache Spark. Recently Apache Spark has been used in many application fields like machine learning and graph calculation. To the best of our knowledge, there is little work to conduct parallel FFT with Spark in Cloud.

3

System Model

In this section, we introduce the strategies, algorithms and terminologies used in this paper. For reference, we summarize the main notation used in this paper in Table 1. 3.1

Data Processing Model

We consider the data processing model as follows. To conduct FFT in parallel, the data need to be processed in the model of MapReduce. Consider a N points signal data, in which each point has 16-bit data. The data are divided into p data sets D0,0 , D0,1 , ...D0,p−1 . These data sets are mapped to a compute Cloud which is formed by p compute nodes C1 , C2 , ..., Cp . The data sets are processed in the butterfly algorithm from stage 0 to stage log2 (N/p). In the lth stage, each even numbered data set Dl,2k combines with each odd numbered data set Dl,2k+1 to execute Fourier Transform to get result as Dl+1,k . Finally, two last data sets combine to the final result Dlog2 (N/p−1,0) . Inside a compute node, the data are stored in an array Xk . The data are separated into even numbered parts Ek and odd numbered parts Ok . In order to test the effectiveness, we let Tk denote  Tk , the total time of the the time consumed in the kth stage and Ttotal = algorithm. Figure 1 reveals the process of the parallel FFT based on Apache Spark in Cloud environment.

298

3.2

C. Yang et al.

Compute Resource Allocation Model for Parallel FFT Job in Cloud

Consider a compute Cloud with n CPU cores and m GB cache. We propose two strategies to split these compute resources. In equally-split strategy, the compute Cloud consists of p compute nodes and an extra master node that manages these compute nodes. Since the compute resources are equally distributed to each compute node, each compute node has n/p cores and m/p GB cache. All of the compute nodes participate in the process of data processing from beginning to the end. In optimized-split strategy, the compute resources are distributed into compute nodes with different sizes. In order to execute different stages of FFT, the compute Cloud is separated to several sections s1 , s2 , ..., sn . Different compute sections conduct different stages of FFT. Inside each section si , the resource is equally divided to ri compute nodes Cik . The size of compute nodes vary in different sections. To increase the efficiency of data processing, we search the optimal portion of each section by defining the size θk of each section and size ωk of each node in sections. For different strategies, the data processing methods are different, which will be discussed in later parts.

4

Framework of Parallel FFT Algorithms

In this section, we discuss the framework of the parallel FFT in this paper. 4.1

Parallel FFT Algorithms in Distributed Compute Cloud

The overall methodology of the Parallel Fast Fourier Transform Algorithm is breaking down the input data and distributing the small data sets to compute nodes in a Cloud. Then each compute node executes FFT algorithm independently and collaboratively. At last, the results are collected by the master node of the Cloud. During the whole process, Apache Spark takes the role of distributing the input data and collecting the result. This big data processing framework provides an efficient approach to storing and computing data in the form of RDD. Each RDD is mapped to each compute node to conduct FFT computation. The compute node is organized in two ways according to different resource allocation strategies which will be discussed in Sect. 5. After the FFT computation, the results are collected by the master node. 4.2

Fast Fourier Transform Algorithm

Fast Fourier Transform is a widely used numerical algorithm in signal processing field. Fast Fourier Transform re-expresses the Discrete Fourier Transform of an arbitrary composite size N = N1 , N2 in terms of N1 smaller DFTs of sizes N2 ,

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

299

Fig. 2. Data Preprocessing: Bit reverse

recursively, to reduce the computation time to O(N logN ) for highly composite N . The Discrete Fourier Transform is expressed as follows: Xk =

N −1 

n

xn e−i2πk N ,

(1)

n=1

where xn is a time signal array with period N and k = 0, ..., N −1. Among all the Fourier Transform algorithms, the radix-2 Cooley-Tukey algorithm is the most popular FFT algorithm. Using the thought of divide-and-conquer, the time of DFT is largely shortened. FFT separates the vector into even and odd numbered parts and reduces the length from N to N/2. FFT recursively executes the separate-operation to attain smaller data sets. After small data sets are generated, FFT combines these data sets and calculates the results: N 2

Xk =

−1 

−i2πk 2m N

x2m e

N 2

+

m=0

−1 

x2m+1 e−i2πk

2m+1 N

.

(2)

m=0

The formula above consists of two summations. The left summation contains the even number part of the original formula and the right contains the odd 1 number part. By defining a twiddle factor WNk = e−i2πk N , the former formula implies: N N 2 −1 2 −1   2km k Xk = x2m WN +WN x2m+1 WN2km . (3) m=0

m=0

km . The equation Further, It can be found that the twiddle factor WN2km = W N 2 can be simplified to: N 2

Xk =

−1 

m=0

km

x2m W N + 2

WNk

N 2

−1 

km x2m+1 W N .

m=0

2

(4)

Let Ek be the even part of the vector and Ok be the odd part. The N/2-potint DFT outputs can be written as: N 2

Ek =

−1 

m=0

km

x2m W N 2

N 2

Ok =

−1 

m=0

km x2m+1 W N . 2

(5)

300

C. Yang et al.

Fig. 3. 8 points Butterfly Diagram

Consequently, the complete DFT can be expressed as:  Ek + WNk Ok 0 ≤ k ≤ N/2 . Xk = k−N/2 Ok−N/2 N > k ≥ N/2 Ek−N/2 − WN

(6)

By using the divide-and-conquer concept, FFT reduces the complexity of the algorithm from O(N 2 ) to O(N log2 (N )). Rather than computing the complete data, it is easier to compute a number of smaller data sets. As a result, the number of Fourier Transform calculations needed to be executed decreases dramatically. Before the decompose operation, the initial data need to be rearranged in bit-reverse order, as shown in Fig. 2. Next, the rearranged data are combined so that the DFT can be calculated. The DFT process and combine process are represented in a so called “butterfly diagram” as is illustrated in Fig. 3. In the first stage, a pair of data sets forms the input of the first DFT calculation. Then, the output data sets of the first stage become the input of the DFT calculation in second stage. Since this process is repeated time after time, data sets combine together and become larger ones. As a new data set is formed by two smaller data sets, the number of calculation stages is determined by log2 N operations. 4.3

Parallel FFT Algorithm

Although the FFT algorithm effectively decreases the amount of calculations in DFT, however, when data size becomes immensely large and data processing faces the real time demand, FFT in single compute device cannot fulfill the requirements in practice. Fortunately, we can parallelize the algorithm in a compute Cloud to further accelerate the computing process. Our Parallel Fast Fourier Transform Algorithm consists of three steps: Data Pre-processing, individual Butterfly computation, and collaborative Butterfly computation.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

301

The first step is preprocessing the data for later computation. In this step, data are rearranged in bit-reverse order and divided into m blocks so that N/m data items can be separately stored into RDDs (Resilient Distributed Dataset, the data structure in Spark), where N is the number of sampling points in the signal data and m is the number of compute nodes. The second step is to execute the butterfly computation within each compute node. In the first log2 (N/p) data processing stages, no data exchange between compute nodes is required. So after the data are rearranged in bit reverse order and stored in each compute node, the N/p-point FFT is performed to obtain the result separately. However, in the rest log2 (N/p) stages of FFT, data exchange is necessary because the data length is larger than N/p. In the last step, the compute nodes cooperate to calculate the result. Data Preprocessing. As shown in Fig. 2, before the calculation is performed, the data need to be rearranged in bit-reverse sequence. The algorithm of bitreverse is shown in Algorithm 1. This job is finished in the master node of the Cloud. Then, the reordered data are sequentially separated into p data sets. These data sets are stored in Resilient Distributed Datasets in Apache Spark. Then these data sets are sent to the compute nodes to complete the rest calculation. Algorithm 1. Preprocessing Require: a = (a0 , a1 , ..., an−1 ) Ensure: b = (b0 , b1 , ..., bn−1 ) b = bitReverse(a) for i ← 0 to p − 1 do for k ← 0 to N/p − 1 do P [i].c[k] ← b[i ∗ N/p + k] end for end for

Local FFT Inside Each Compute Node. Once N/p data are received by each compute node, N/p-point FFT will be executed on these data sets. Since there is no data exchange between compute nodes, each compute node performs original FFT on its local data, as shown in Algorithm 2. FFT with Data Exchange. After the former half log2 p FFT, each data set needs to be combined to complete the rest calculations. Therefore the data exchange is required. The computation is performed from the log2 (N/p)-th stage to the (log2 N − 1)-th stage where the compute nodes need communication. The algorithm is shown in Algorithm 3.

302

C. Yang et al.

Algorithm 2. Local N/p − point FFT on each compute node Require: c = (c0 , c1 , ..., cN/p−1 ) Ensure: c = (c0 , c1 , ..., cN/p−1 ) for i ← 0 to p − 1 do for k ← 0 to N/p − 1 do P [i].c[k] ← b[i ∗ N/p + k] ; if ((i ∗ N/p + k)mod l = (i ∗ N/p + k) mod 2l) then c[k] = c[k] + c[k + l ∗ z m ] ; c[k + 1] = c[k] − c[k + l ∗ z m ] ; end if end for end for

Algorithm 3. FFT with data exchange Require: c = (c0 , c1 , ..., cN/p−1 ) Ensure: c = (c0 , c1 , ..., cN/p−1 ) j = log2 (p) + 1 ; for e ← 0 to log2 (p) − 1 do t = 2e, l = 2(e+log2 (N/p)) , q = n/2l, z = wq ; j = j − 1, v = 2j ; for i ← 0 to p − 1 do if (i mod t = i mod 2t) then Receive data block from (i + p/v)th compute node and store into c[N/v] − c[N/v + N/p − 1] ; for k ← 0 to N/p − 1 do m = (i ∗ N/p + k) mod l ; c[k] = c[k] + c[k + N/v] ∗ z m ; c[k + N/v] = c[k] − c[k + N/v] ∗ z m ; end forSend transformed data in c[N/v] − c[N/v + N/p − 1] to the (i + p/v)th compute node ; else Send the data of this compute node to the (i − p/v)th compute node ; After the transformation, Receive data from the (i − p/v)th compute node and store them into c. ; end if end for end for

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

303

Fig. 4. Data exchange in a 16-compute-node Cloud

As shown in Algorithm 3, N/p-pair butterfly computation is performed in one compute node while the other paired compute node just sends whole N/p data to the corresponding compute node and waits until transformed data return. For communication overhead, 2N/p data are exchanged at every stage where two N/p data transfers are needed for sending and receiving on an idle compute node.

5

Compute Resource Allocation Strategy

As discussed above, the parallel FFT performs butterfly calculation on a compute Cloud. The input data are mapped to each compute node and the results are sent to the master node at last. During this process, the speed of computing is determined by the performance of the compute Cloud. In our experiments to process large amount of signal data, we found that the data exchange results in a great amount of I/O time between compute nodes when data size becomes large, as shown in Fig. 4. This is because in the last log2 p − log2 (n/p) stages, the sizes of data to be calculated is larger than the size of local data in each compute node. As a result, the performance of parallel FFT algorithm is limited to a low level. In order to increase the speed and make full use of the compute resources in a Cloud, we propose two strategies, i.e., equally-split strategy and optimized-split strategy to allocate them. In a compute Cloud, the compute resource is fixed in common sense. We assume that the compute resources are fully used by the compute Cloud because

304

C. Yang et al.

in this way the compute process can be more efficient. We designed two strategies to allocate the limited compute resources. One strategy is to equally split the total resources and the competence of each compute node is equal while another strategy is to allocate unequal resource to compute nodes. These two methods have their own pros and cons which will be discussed in later sections. In common sense, computation ability is decided by the number of CPU cores, the size of cache, and so on. Since Fast Fourier Transform mainly uses CPU to process data, we consider the number of CPU cores in each compute node as the main factor to decide the computation ability. In a certain compute Cloud, because the amount of CPU cores is fixed, when the number of CPU cores in each compute node increases, the number of compute nodes decreases. 5.1

Equally-Split Strategy

We assume that the total compute resource is limited to n CPU cores and m GB caches. The equally-split strategy is to split these resources into p (p should be 2t where t is a integer) pieces. Each compute node has k/p G caches and m/p GB cache. The input data are also equally split into p pieces and distributed to each compute node. After each compute node completes its calculation on its local data, data exchange between compute nodes is required to finish the rest computing work. Assume the input data size is N . There are log2 (l) steps in a whole butterfly computing process. Between every two steps, data need to be exchanged once. Hence, the total size of data to be exchanged is N ∗ (log2 (l) − 1). 5.2

Optimized-Split Strategy

The core idea of optimized-split strategy is to make data flow as stream in the Cloud. This method can avoid data exchange. Although equally-split strategy is a simple way to conduct parallel FFT Algorithm, unfortunately, too many data exchanges result in low speed. To better use the compute resources, we design optimized-split strategy to redistribute the compute resources. Like in equally-split strategy, we also set the total compute resources in the Cloud as n CPU cores and m GB caches. In our experiments, we found that the CPU and cache both have important impact on the FFT algorithm altogether. Therefore, we bind 1 core and 2 GB together as a computing unit. Every compute node can have 1 or n (n is an integer) computing units only. In order to execute different stages of FFT, the compute Cloud is separated to sections s1 , s2 , ..., sn . Different compute sections conduct different stages of FFT. Inside each section si , the resource is equally divided into compute nodes Cik to complete parallel calculations. Because the workload in each stage varies, we set different size θk for nodes and different size ωk for sections.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

305

For example, there are 48 CPU cores and 96 GB cache in a compute cloud. These resources can be divided into 3 sections s1 , s2 , s3 . The first section s1 has 16 compute nodes C1i , C12 , ..., C116 , each of which has 1 core and 2 GB cache. The second section s2 has 4 compute nodes C2i , C22 , ..., C24 , each of which has 4 cores and 8 GB cache. The third section s3 has 1 compute node C31 which has 16 cores and 32 GB cache. When data come to this Cloud, they are divided to 16 parts D1i , D12 , ..., D116 and sent to each compute C1i , C12 , ..., C116 in the first section. Then these compute nodes conduct FFT on their local data and send the results to compute nodes D2i , D22 , ..., D24 in the second section. D2i , D22 , ..., D24 also execute the later stages of FFT on their local data and send results to D31 . D31 completes the rest computations and obtains the final result. It should be mentioned that the data come as a stream. Hence, the data stream is constantly flowing in the Cloud from section 1 to section n. Therefore, there is no idle resource in our system. The main goal of this distribution method is to find an optimal way to balance the portion of sections. Compute node’s size determines its performance. More CPU cores and larger cache means faster speed on computation. When former stage of FFT is too slow, the later stage will not be executed and the next section will be idle. When compute nodes take too much resources, they may wait former computations.

6

Experiment Results

In this section, we present experimental results to illustrate the previous theoretical improvements. The ASFFT (parallel FFT algorithm is implemented in Apache Spark) and comparison with the MFFT algorithm (FFT in Matlab) is given. The data used in the experiments are signal data from satellites. Since the satellites constantly send data to data center, the data arrive in stream. When an amount of 64 MB data arrive, the data become a data block. Hence, a requirement of the system is to finish data processing job before the next data block comes or the data are accumulated and the computation is delayed. The compute Cloud we used has the resource of 48 CPU cores and 96 GB cache. Apache Spark is installed in the virtual machines to send data and execute computation. Different distribution strategies are implemented to the Cloud. In Figs. 5 and 6, we show the comparison between our parallel FFT in Apache Spark and the FFT in Matlab. We use 10MB data unit and 2MB data unit to conduct our experiments. The results show that when data scale is small (shown in the left-side columns in Figs. 5 and 6), MFFT takes less time than ASFFT algorithm. The reason is that Apache Spark is designed to conduct computation for big data. When data scale is small, the initialization of Spark engine takes a big portion of total time. When data scale rise up, the initialization of Spark takes smaller portion so the parallel FFT performs better.

306

C. Yang et al. 200 MFFT ASFFT

180 160

Time (ms)

140 120 100 80 60 40 20 0

5*10m

10*10m

50*10m

100*10m

500*10m

Data volume

Fig. 5. 10 MB data unit comparison 2000 1800 1600

MFFT ASFFT

Time (ms)

1400 1200 1000 800 600 400 200 0

5*2m

10*2m 50*2m 100*2m 500*2m 1000*2m 5000*2m

Data volume

Fig. 6. 2 MB data unit comparison

However, with increase in data scale, the time spent in computation increases drastically. By comparison, although ASFFT spends more time than MFFT when data scale is small, the ASFFT shows its advantage when data scale is large. By comparison between Figs. 5 and 6, we can observe that the ASFFT shows more obvious advantage when the data unit is smaller (2MB). This is because when the data unit’s size is smaller, the FFT algorithm is easier to be conducted. From Figs. 7 and 8, we can see that the parallelization of FFT effectively reduces the algorithm time. With more CPU cores, the speed of algorithm increases. When there is 1 CPU core in the Cloud, the FFT is not parallelized and the speed is low. When there are 2 cores, the time spent by the algorithm reduces greatly to nearly a half. With more and more cores, the increase of the algorithm becomes more and more unobvious.

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data 7

307

104 1 core 2 cores 3 cores 4 cores 5 cores

6

Time (ms)

5

4

3

2

1

0

20 partitions

10 partitions

5 partitions

Partitions Number

Fig. 7. Parallel effectiveness comparison of 10 MB data 3

104 1 core 2 cores 3 cores 4 cores 5 cores

2.5

Time (ms)

2

1.5

1

0.5

0

100 partitions 50 partitions

20 partitions

10 partitions

5 partitions

Partitioins Number

Fig. 8. Parallel effectiveness comparison of 2 MB data

In addition, the partition number also affects the algorithm speed. Too many partitions cause low efficiency. This result is because more data partitions mean more data RDDs formed in Spark. Spark divides the original data into more data partitions, which takes redundant time. Therefore, finding a smaller number of data partitions can be significantly efficient. Figure 9 reveals the comparison between the two split strategies. The experiment was conducted in a Cloud with 16 CPU cores and 32G cache. In the equally-split strategy, there are 8 workers with 2 CPU cores and 4G cache. In the optimized-split strategy, there are 4 small workers with 2 CPU cores and 4G cache and 1 large worker with 8 CPU cores and 16G cache. When data size is small, the equally-split strategy performs better than the optimized-split

308

C. Yang et al. 1400 1200

Optimized Split Equally Split

Time (ms)

1000 800 600 400 200 0 64m

320m

640m

3200m

6400m

64000m

640000m

Data Size

Fig. 9. Comparison between split strategy

strategy. Nonetheless, when the data size becomes larger, the optimized-split strategy shows its advantage.

7

Conclusion and Future Work

We have presented a parallel Fast Fourier Transform algorithm in Cloud. Using a big data framework called Apache Spark, this algorithm stores intermediate data in cache which decreases the time in FFT. A three-step parallel FFT method is proposed, which enables FFT to be computed concurrently in different compute nodes. The existing parallel FFT algorithm has the problem of too many data exchange between compute nodes. This problem results in the low efficiency of the algorithm. We propose a new strategy to reallocate the computation resource. By optimized-split the CPU cores and cache into each compute node, data exchange decreases. We have validated our algorithm through comparisons and implementation in a Cloud. To improvement the performance of parallel FFT algorithm, there are many other works could be done. In this paper, we propose some strategies to allocate the computation resources. However, they can be further developed by considering more attributes of computation resources. We also noticed that some researches study the performance of FFT algorithm on GPU cluster, which could be another direction of our future work. Acknowledgements. The authors would like to thank the anonymous referees for their helpful comments from which the preparation for this version of the paper has benefited. Thanks for Johann Sebastian Bach for his inspiring music accompanying the authors to complete the research. This work was supported in part by the National Natural Science Foundation of China under Grant 61572511 and Grant 91648204 and Grant 61872378, in part by the Scientific Research Project of National University of

A Parallel Fast Fourier Transform Algorithm for Large-Scale Signal Data

309

Defense Technology under Grant ZK16-03-57, in part by the China Postdoctoral Science Foundation under Grant 2016M602960 and Grant 2017T100796, in part by Science Fund for Distinguished Young Scholars in Hunan Province under Grant 2018JJ1032. Xiaomin Zhu is the corresponding author.

References 1. Armbrust, M., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010) 2. Armbrust, M., et al.: Above the clouds: A Berkeley View of Cloud Computing. Tech. rep., Technical ReportD UCB/EECS-2009-28, EECS Department, University of California, Berkeley (2009) 3. Baker, S.: Google and the wisdom of clouds. Business Week 14 (2007) 4. Bergland, G.D.: A parallel implementation of the fast fourier transform algorithm. IEEE Trans. Comput. 100(4), 366–370 (1972) 5. Bosagh Zadeh, R., et al.: Matrix computations and optimization in apache spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 31–38. ACM (2016) 6. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965) 7. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 8. Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1384. IEEE (1998) 9. Githens, J.: A fully parallel computer for radar data processing. In: IEEE Transactions on Aerospace and Electronic Systems, p. 736. No. 5 (1970) 10. Hassanieh, H., Adib, F., Katabi, D., Indyk, P.: Faster gps via the sparse fourier transform. In: International Conference on Mobile Computing and Networking, pp. 353–364 (2012) 11. Hassen, H., Khemakhem, M.: Arabic islamic manuscripts digitization based on hybrid K-NN/SVM approach and cloud computing technologies. In: Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences (32519), pp. 366–371. IEEE (2013) 12. Kulkarni, P., Kumar, V., Verma, H.: Diagnostic acceptability of FFT-based ECG data compression. J. Med. Eng. Technol. 21(5), 185–189 (1997) 13. Li, F., Xu, J., Zhouhong, J., Miao, W.: Aerial target detection via GPS satellite broadcast signal. J. Chin. Inert. Technol. 22(6), 788–793 (2014) 14. Marcum, J.: A statistical theory of target detection by pulsed radar. IRE Trans. Inf. Theory 6(2), 59–267 (1960) 15. Marple, L.: Computing the discrete-time “analytic” signal via FFT. IEEE Trans. Signal Process. 47(9), 2600–2603 (1999) 16. Mullin, L.R., Small, S.G.: Four easy ways to a faster FFT. J. Math. Model. Algorithms 1(3), 193–214 (2002) 17. Palanisamy, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2011) 18. Prasad, N., Shameem, V., Desai, U., Merchant, S.: Improvement in target detection performance of pulse coded doppler radar based on multicarrier modulation with fast fourier transform (fft). IEE Proc. Radar, Sonar Navig. 151(1), 11–17 (2004)

310

C. Yang et al.

19. Preuss, R.: Very fast computation of the radix-2 discrete fourier transform. IEEE Trans. Acoustics, Speech, Signal Process. 30(4), 595–607 (1982) 20. Reddy, B.S., Chatterji, B.N.: An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Process. 5(8), 1266– 1271 (1996) 21. Spark, A.: Lightning-fast cluster computing (2016) 22. Tang, G., Peng, L., Baldwin, P.R., Mann, D.S., Jiang, W., Rees, I., Ludtke, S.J.: Eman2: an extensible image processing suite for electron microscopy. J. Struct. Biol. 157(1), 38–46 (2007) 23. Ubeyli, E., G¨ uler, I.: Spectral analysis of internal carotid arterial doppler signals using FFT, AR, MA, and ARMA methods. Comput. Biol. Med. 34(4), 293 (2004) 24. Vincke, R., Landschoot, S.V., Cordemans, P., Peuteman, J., Steegmans, E., Boydens, J.: Algorithm parallelization using software design patterns, an embedded case study approach. In: Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 470–473 (2013) 25. Wang, Y., Yang, R., Wo, T., Jiang, W., Hu, C.: Improving utilization through dynamic VM resource allocation in hybrid cloud environment. In: IEEE International Conference on Parallel and Distributed Systems, pp. 241–248 (2015) 26. Welch, P.: The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Trans. Audio Electroacoust. 15(2), 70–73 (1967) 27. Wold, E., Despain, A.: Pipeline and parallel-pipeline FFT processors for VLSI implementations. IEEE Trans. Comput. C–33(5), 414–426 (1984) 28. Xu, L., Li, J., Stoica, P.: Target detection and parameter estimation for mimo radar systems. IEEE Trans. Aerosp. Electron. Syst. 44(3), 927–939 (2008) 29. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010) 30. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016) 31. Zhu, X., Mong Sim, K., Jiang, J., Wang, J., Chen, C.: Agent-based dynamic scheduling for earth-observing tasks on multiple airships in emergency. IEEE Syst. J. 10(2), 661–672 (2016) 32. Zhu, X., Wang, J., Guo, H., Zhu, D., Yang, L.T., Liu, L.: Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Trans. Parallel Distrib. Syst. 27(12), 3501–3517 (2016)

Task Offloading in Edge-Clouds with Budget Constraint Lei He1 , Hongli Xu1(B) , Haibo Wang1 , Liusheng Huang1 , and Jingyi Ma2 1

Department of Computer Science and Technology, University of Science and Technology of China (USTC), Hefei, China {hl1994,wanghaib}@mail.ustc.edu.cn, {xuhongli,lshuang}@ustc.edu.cn 2 TianPing College of SuZhou University of Science and Technology, SuZhou 215011, Jiangsu, China [email protected]

Abstract. Edge computing is an emerging computing model that extends the cloud and its services to the edge of network. In edge-cloud computing, a set of servers are deployed near the mobile devices such that these devices can offload tasks to the servers with low latency. Most existing works usually focus on offloading tasks under the premise that sufficient resources are owned by edge servers while ignoring budget constraint of user. If failed to consider about this, the existing offloading schemes may cause user to overspend, this is unacceptable to user. Thus, in this paper, we investigate the task offloading problem in edge-cloud computing aiming to minimize the task duration while tasks are generated by user with constrainted budget. Besides edge servers are equipped with limited computation and storage resources. Specifically, the problem we formulate is an NP-hard problem. In order to solve it, we propose a heuristic strategy. The simulation results prove that the proposed scheme can improve the success ratio and reduce the task duration, compared to random and greedy offloading schemes. Keywords: Edge computing

1

· Task offloading · Budget constraint

Introduction

Mobile devices are commonly used in people’s life everyday. It is predicted that by 2020 the total quantity of devices would be 75 billion, while the volume of mobile traffic would exceed 24.3 exabytes/month [1]. Furthermore, mobile devices will be more and more intelligent while the applications in mobile devices become increasingly resource-hungry. These applications include wearable virtual reality (VR) [2] streaming, augmented reality (AR) [3] and vehicular system [4], etc. However, the gap between required resources and those available in mobile devices widens. To bridge this gap, mobile applications can offload their computation-intensive tasks to remote clouds [5]. However, an evident weakness of public cloud based mobile cloud computing is that mobile users may experience long latency for data exchange with the public cloud through the wide area c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 311–326, 2018. https://doi.org/10.1007/978-3-030-05057-3_25

312

L. He et al.

network. Long latency would hurt the interactive response, since humans are acutely sensitive to delay and jitter. Moreover, it is very difficult to reduce the latency in the wide area network. To deal with long latency of remote clouds, edge computing [6,7] has been proposed, which extends the cloud computing by placing a number of small scale servers at the edge of network. In this way, users can offload their tasks to edge servers and receive computing results with low network latency. However, compared to cloud computing, the scale of edge servers is rather small. The development of tasks is restricted by both the resource and computation capacity. Any particular edge server might not be able to support large scale computing tasks. Therefore, the tension between resource-hungry tasks and resource-constrained edge servers hence poses a significant challenge for the future mobile platform development. In recent years, researchers have been pay more attention to the performance of edge-cloud network, especially for the task offloading problems. In summary, those works focus on two main aspects of task offloading. (i) Minimizing the task duration(e.g. [8,9]). When mobile devices create tasks, those works first decide whether the tasks should offload to edge-cloud or not. Then choose between the remote and nearby edge servers to offload tasks, which is based on the amount of computation resources required. (ii) Considering saving the network energy cost, the energy efficient resource allocation scheme is studied for mobile edge computing in [10,11]. However, when we consider more about this, the effective of the offloading strategy faces the following challenges: (i) The limited resources of edge servers. These resources include the computation resources and storage resources. In the big data era, most tasks are created to train a general model from big data. Those tasks need appropriate edge servers to generate models and store data set. So, how to control these computing resources is a challenge problem. (ii) The budget constraint of user. As an extension of cloud computing, edge servers charge for the services from user, the price of service depends on the resources asked from tasks. Thus, the existing works may not useful at this, because those works may leading the overspend of user. For example, when a user plays VR game on mobile device, the user needs to pay for the game. So when the game application decides to offload tasks to edge-clouds, the cost of the offloading should not exceed the amount of payment, and the user desires the low delay of tasks and high success ratio of task offloading. Thus how to conduct an effective task offloading according to the budget constraint is challenging. Under this context, to get low latency of task offloading, we should consider how to match the desired resources of users to the limited resources of edge servers. In this paper, we study the offloading problem in edge-cloud network. To be specific, the edge servers are equipped with limited computation and storage resources, while the user who decides to offload tasks to edge servers with constrainted budget. Furthermore, the task offloading problem we formulate is an NP-hardness problem. To solve the problem, we propose a budget constraint task offloading scheme (BCTO). Our proposed strategy of task offloading aims to minimize task duration. In detail, our task offloading scheme includes two

Task Offloading in Edge-Clouds with Budget Constraint

313

parts: (i) Computing cost of the computation. When a task offloading to an edge server, the edge server gives the cost of the computation of this task based on the computation and storage resources required by the task. (ii) According to the cost of computation task on every edge server and the user budget, our scheme calculates the effective of the cost and decides which of the server should be allocated to the task. In this paper we assume the task duration is the execution time of the tasks on edge servers, the budget is set by the user who decides to offload tasks. When user creates many tasks, we assume these tasks are independent. Which means for any two tasks, the result of one task has no impact on another task, so tasks can be executed concurrently. When user decides to offload tasks to edge servers. User acts as a buyer with constrainted budget, while the computation and storage resources are regarded as the commodities. To ensure the cost of the computing and store of tasks does not exceed the budget. The BCTO scheme chooses the appropriate edge server for offloading tasks. The main contributions of this paper are summarized as follows: 1. We propose a price model of the edge servers, which measures the price of computation tasks based on the computation and storage resources required by tasks. By using this model, the cost of each task on each edge server can be obtained, and thus enabling the optimal task offloading. 2. We present an efficient budget constraint task offloading scheme. Based on the budget, the cost of every task and the execution time on edge servers, the scheme chooses an appropriate edge server for every task. Our scheme can not only improve the success ratio of offloading but also reduce the task duration. 3. We conduct extensive experiments to evaluate the performance of BCTO scheme. The experimental results validate that our proposed algorithm can improve 5%∼10% success ratio and reduce at least 30% of the task duration compared to random and greedy offloading schemes. The rest of this paper is organized as follows. In Sect. 2, we present the related works. In Sect. 3 the system model is described. In Sect. 4 we give the problem formulation. We propose the efficient task offloading algorithm in Sect. 5. Our simulation results and discussions are given in Sect. 6. Finally, Sect. 7 concludes this paper.

2 2.1

Related Works Mobile Edge Computing

At present, mobile devices become more and more powerful and intelligent. However, the development of mobile devices does not catch up with the demand of resource of applications. So it is difficult to handle all application tasks directly in mobile devices. Mobile cloud computing has proposed as a solution, and offloading heavy computation tasks to the remote cloud data centers has been studied

314

L. He et al.

for over a decade. CloneCloud [12] was proposed to use cloned virtual machine images in the cloud for mobile job offloading. Follow me cloud [13] was proposed for offloading computation-intensive tasks to the cloud for processing. COMET [14] migrates the application threads between mobile device and the cloud by using a distributed shared memory model. However, since the locations of cloud servers are far away from mobile devices, offloading tasks to cloud may get a long delay. To overcome this challenge, mobile edge computing was proposed to provide nearby rich computing resources to mobile users [15], and there have been quite a lot of studies on the resource allocation problem. [16,17] offload the tasks to the nearest edge servers since it is easy to apply, but it may lead a severe competing for the limited resources of edge server, To solve this problem, a hierarchical architecture has been proposed [9]. The architecture divides the edge-clouds into different levels according to the distance to the edge, and presents heuristic algorithm to minimize the task duration. Most existing works consider a single edge server in task offloading. The work in [18] proposed that the cooperation of edge clouds can not only reduce the processing delay of user tasks, but also reduce the energy consumption. In a word, mobile edge computing can improve the quality of service and energy efficiency by optimizing task offloading and resource allocation policies. [19] pointed out that it is much better to processing the tasks in edge-clouds than processing at the edge-clouds in isolation. Some of the works above assumed that the releases of tasks follow some known stochastic. So in [8] an online algorithm without any assumption of the task release distribution has been proposed. 2.2

Task Offloading with Limited Budget

There are many works on task scheduling with budget constraint, such as in the grid or cloud environments, [20] developed scheduling approaches, LOSS and GAIN, to adjust a schedule which is generated by a time optimized heuristic and a cost optimized heuristic to meet users budget constraints respectively. But this strategy should be supported by other scheduling algorithm. BaTS [21], a budget and time-constrained scheduler, can schedule large bags of independent tasks onto multiple Clouds. Zhu et al. [22] proposed a dynamic scheduling for fixed time-limit and resource budget constraint tasks. Reference [23] focuses on using genetic algorithms to solve the scheduling problems considering the budget and deadline of entire network. Recently, HCOC [24] discusses workflow execution in a cloud context. HCOC reduces monetary cost while achieving the established desired execution time by deciding which resources should be leased from the public cloud or be used in the private cloud. Byun et al. [25] provided PBTS (Partitioned Balanced Time Scheduling) which estimates the minimum number of computing hosts required to execute a workflow within a user-specified finish time. However, the computing hosts are used as the same monetary cost per time unit. For large graph processing in Cloud, Li et al. [26] designed a cost-conscious scheduling algorithm (CCSH) which is an extension of HEFT.

Task Offloading in Edge-Clouds with Budget Constraint

315

In this paper, we construct analytical models to quantify independent tasks execution performance in edge computing, and we incorporate the price model into cost calculation.

3 3.1

System Model Network Model

We consider an edge computing scenario with M heterogenous edge servers, M = {s1 , s2 , ..., sM } each of which is equipped with limited computation and storage resources. For each of the edge server, we assume the resource status of edge server si be the 2-tuple (Ric , Rim ), where Ric and Rim are the computation resource and storage resource owned by edge server si . The computation resource is described in terms of CPU cycles while the storage resource is quantified by the size of GB. There is a set T = {t1 , t2 , ..., tN } of indivisible tasks, those tasks are offloaded by user with the constrained budget B. We adopt a widely used task model (see [7,27,28]) to describe task tj = (aj , cj ), i.e., where aj stands for computation amount of task, i.e., the CPU cycles needed in total to compute task, and bj stands for the size of computation task, i.e., the amount of data contents (e.g, the data input and associated processing code) to be delivered toward the edge servers. In our model, a mobile device will dispatch tasks to an edge server immediately after its release. We do not allow the servers to migrate a task to other servers after the offloading to avoid migration overhead, and we assume a server can execute at most one task at a time preemptively. 3.2

Price Model

When an edge server equips with limited computation and storage resources, the price of those resources refers to the cost of offloaded tasks to be executed in the server. It is more reasonable to value the resources according to edge servers’ performance when utilizing these resources. Let PiC (q) denotes the price of computation for q units of CPU cycles per second of edge server si , and PiS (e) stands for the price of e units size of storage of edge server si . In our price model of computation price, we adopt a nonlinear model. The function can be denoted as: P C (x) x  iC y Pi (y)

i ∈ M, x, y ∈ {1, 2, ..., Ric }.

(1)

where PiC (x) and PiC (y) denote the price of x and y units of CPU cycles on edge server si , Ric denotes the computation resource limitation on edge server si . For the storage price, we define a linear function to the size of storage. The function can be denoted as following: P S (x) x = iS y Pi (y)

i ∈ M, x, y ∈ {1, 2, ..., Rim }.

(2)

where PiS (x) and PiS (y) denote the price of x and y units of storage size on edge server si , Rim denotes the storage resource limitation on edge server si .

316

3.3

L. He et al.

Task Offloading Model

In this section, we will introduce the computation task offloading model in detail. As we describe above, the task tj can be described tj = (aj , cj ), considering the difference of computation resources of edge server, we denote the computation resources (CPU cycles per second) of edge server si as Ric , according to the network model, the task duration on edge server si is the time when task is executed on the edge server. Therefore, the task duration of task tj on edge server si can be obtained as follow: tij =

aj Ric

i ∈ M, j ∈ N.

(3)

Similar to the study [29], we ignore the transmission delay for edge servers to send data from user or to user. This is because the edge servers are deployed very close to the mobile devices. The processing time of tasks on edge servers are the main part compared to the transmission time of tasks.

4

Problem Formulation

In this paper, in terms of limited computation and storage resources of edge servers, we consider the following problem: how to select the appropriate edge servers for tasks while achieving the minimum task duration under the constraint budget of user. Define the matching matrix as X = {xij }M ∗N , where xij is the indicator revealing whether edge server si can serve task tj . If task tj is offloaded to the edge server si , then we have xij = 1, otherwise xij = 0, the matching matrix must satisfy the following constraint: M 

xij  1 i ∈ M, j ∈ N.

(4)

i=1

which ensures that one task can only be served by at most one edge server. If task tj is allowed to be served by edge server si , then the cost of computation of task tj on edge server si is: pij = PiC (aj ) + PiS (cj )

i ∈ M, j ∈ N.

(5)

For each edge server si , the total cost for the tasks executed on the server is: pi =

N 

xij pij

i ∈ M, j ∈ N.

(6)

j=1

When task tj is offloaded to edge server si , the time of the execution of task tj on edge server si is tij as we describe in the task offloading model. Thus, the overall time of task execution on edge server si can be expressed as follows:

Task Offloading in Edge-Clouds with Budget Constraint

Ti =

N 

i ∈ M, j ∈ N.

xij tij

317

(7)

j=1

According to the analysis above, the problem we need to solve can be formulated as the following: min max(Ti ) i ∈ M, j ∈ N subject

to :

M 

pi  B

i ∈ M,

(8) (9)

i=1 M 

xij cj  Rim

i ∈ M, j ∈ N,

(10)

i=1 M 

xij  1 i ∈ M, j ∈ N,

(11)

i=1

xij ∈ [0, 1]

i ∈ M, j ∈ N.

(12)

Table 1. Notation Table Parameter Definition M

Set of edge server

T

Set of computation task

si

Edge server

tj

Computation task

Ric Rim

Total number of CPU cycles owned by edge server si

aj

The CPU cycles need of task tj

cj

The size of storage amount need of task tj

PiC (q) PiS (e)

The price of q units of CPU cycles and per time unit on edge server si

tij

The task duration of task tj executed on edge server si

xij

The indicator revealing whether edge server si can serve the task tj

Ti

The overall time of edge server si

Texe

The minimum execution time of all the tasks

prij

The price ratio

psij

The effective of price

pall

The cost of all offloaded tasks

s

p

Total number of storage size owned by edge server si

The price of e unit size of storage on edge server si

The set of effective of price of all tasks on edge servers edge servers

318

L. He et al.

The objective function (8) is to minimize the maximum execute time of tasks on edge server. The first constraint (9) indicates for all the tasks execute on edge servers, the cost of the computation and storage should not exceed the constraint budget B. The second constraint (10) states for any edge server si , the storage resource asked of any tasks execute on this edge server is no more than the edge server’s storage resource. The third constraint (11) means that one task can only be served by at most one edge server. The last condition (12) indicates whether a task tj is served by edge server si or not. The problem we formulate is a NPhard problem [30], therefore, we focus on design of a heuristic approach to this optimization problem (Table 1).

5

Task Offloading Scheme in Edge Computing

Our work targets computation-intensive tasks in edge-cloud, where the data transfer time is assumed negligible since: (i) the time for data transfers in most computation-intensive tasks constitutes less than 10 of the overall task execution time [21]. (ii) the edge servers are deployed very close to the mobile devices. Algorithm 1. BCTO(M, T , B). Input: A set of edge servers M equipped with computation and storage resource. A set of tasks T with required computation and storage resource. A fixed budget B. Output: The minimum execution time Texe of tasks. 1: for all si in M do 2: Set the overall time on edge server si Ti = 0. 3: for all tj in T do s 4: if cj < = Ri then 5: Calculate the execution time tij and the price pij according the equation (3) and (5). 6: if pij > B then 7: Exit the program with an error. 8: end if p 9: Calculate the price ratio as prij = Bij . pij . 10: Calculate the effective of price pij , psij = prij × tij 11: else 12: Set tij = 0, pij = 0 and prij = 0. 13: end if 14: end for 15: end for 16: Texe = Offload(ps , T , B). 17: return Texe .

We propose the BCTO algorithm as shown in Algorithm 1. The Algorithm 1 first estimates the cost and time of every task based on the resources required by

Task Offloading in Edge-Clouds with Budget Constraint

319

every task and the computation and storage resources owned by edge servers. If the cost of any task is greater than the budget then we finish the offloading. Then Algorithm 1 gives the price ratio of the task on every edge server. According to the price ratio of task on every edge server, we give the effective of the price on each edge server. In Algorithm 1, the effective of price is part of the offloading strategy in Algorithm 2. Based on the effective of price on each edge server, Algorithm 2 sorts the value of cost of effective. After the sort of the effective of price of a task on each servers, we drop out some of the edge servers that with much lower price effective. From the start index to the end index indicate the edge servers we keep for offloading, in this paper, we drop out two lower price effective edge servers. After the drop out, we will offload the task on the edge server with minimizing increase on the time of edge servers. It is obviously the cost of all the computation tasks will not exceed the budget in Algorithm 2. The time complexity of Algorithm 1 is O(M × N ), where M is the number of edge servers, and N is the number of computation tasks. Algorithm 2. Offload(ps ,T , B) Input: The set of effective of price of all tasks on edge servers edge servers ps . The set of tasks T with the price pij and tij . A fixed budget B. Output: The minimum execution time Texe of those tasks. 1: Set the cost of all offloaded tasks pall = 0. 2: Set M is the number of edge server, start = 1, and end = M-2. 3: Set the overall time on every edge server Ti = 0. 4: for all tj in T & pij = 0 & tij = 0 do 5: if pall < = B then 6: Sort the psij for every i in ascending order. 7: for i from start to end do 8: Offloading the task tj to the edge server si and si = arg min Ti + tij . si ∈M

9: end for 10: Ti = Ti + tij , pall = pall + pij . 11: end if 12: end for 13: Set Texe = arg max Ti . i∈M

14: return Texe .

6

Simulation and Performance Evaluation

In this section, a simulation experiment is provided concerning task offloading for edge computing. The experiment is divided into three parts: (i) We implement task offloading algorithm and evaluate the impact of offloading performance in

320

L. He et al.

comparison with two other task offloading schemes in terms of budget number, the number of edge server and the number of task. (ii) We study the impact of the computation amount on the performance of task offloading in comparison with random offloading and greedy offloading schemes. (iii) We investigate the impact of task data size on the performance of task offloading in comparison with random offloading and greedy offloading schemes. 6.1

Simulation Settings

For task tj , we assume the resources required, namely the CPU cycles aj and data size bj are generated by a probability distribution. Similar to the work [11], we set the computation resources owned by edge servers are range from 20 to 50 GHZ, while the storage resources owned by edge servers are range from 1 GB to 16 GB. 6.2

Comparison to Other Methods

We set the comparison of our algorithm to other two different task offloading strategies: random offloading scheme and greedy offloading scheme. 1. Random offloading scheme: the computation tasks are offloaded to edge servers for processing randomly. We first set up a random generator that can generate a M-tuple, the value in the tuple ranges from 0 to 1 and is generated with equal probability, the sum of value in the tuple is 1, where M is the number of edge server. Then, we get the index which value is maximum in the tuple, finally we offload computation task to the edge server according to the index we get. 2. Greedy offloading scheme: the greedy offloading scheme offloads the tasks to the most powerful edge server to get the minimum task duration. Most of the works (e.g., [16,31]) on edge servers adopted the greedy strategy as the task offloading policy. For the three methods above, the offloading performance we evaluate refers to the task duration and the success ratio of task offloading. The task duration in our work refers to task execution time on edge server, while the success ratio means that the number of successful offloading tasks to the total number of tasks. The CPU cycles of each task are generated by the normal distribution with mean value of 2 GHZ, and the data size of each task is generated by the normal distribution with mean value of 2 GB. When we evaluate the impact of user budget, we set the number of task is 10000, and the number of edge servers is 20. Figure 1 shows the impact of user budget. When compared with random offloading scheme, our proposed scheme can improve 6% success ratio and reduce about 30% of task duration of task offloading. Compared with greedy offloading scheme, our proposed scheme can improve about 30% success ratio while reduce 45% task duration of task offloading.

Task Offloading in Edge-Clouds with Budget Constraint 90

70

BCTO

80 Random

Task Duration(seconds)

Success Ratio(%)

Greedy

70 60 50 40 30 20 10 20000

30000

40000 Budget

50000

Greedy

60 Random BCTO 50 40 30 20 10 0 20000

60000

321

30000

40000 Budget

50000

60000

Fig. 1. Impact of user budget

Figure 2 shows the impact of the number of edge servers, we set the number of tasks is 10000, while the user budget is 60000. We can know that when compared with random offloading scheme, our proposed scheme can improve 8% success ratio and reduce about 30% of task duration of task offloading. Compared with greedy offloading scheme, our proposed scheme can improve about 35% success ratio while reduce 45% task duration of task offloading. 85

BCTO Random Greedy

75 70 65 60 55 50 45

16

18 20 22 The Number of Edeg Server

Greedy Random BCTO

80

Task Duration(seconds)

Success Ratio(%)

80

24

70 60 50 40 30 20 10

16

18 20 22 The Number of Edge Servers

24

Fig. 2. Impact of edge server number

Figure 3 shows the impact of the number of tasks, we set user budget is 100000, while the number of edge servers is 20. When compared with random offloading scheme, our proposed scheme can improve 5% success ratio and reduce about 35% of task duration of task offloading. Compared with greedy offloading scheme, our proposed scheme can improve about 20% success ratio while reduce 40% task duration of task offloading. 6.3

Impact of Computation Amount of Task Offloading

In this section, we consider the impact of computation amount on task offloading performance. The data size of task follows a normal distribution with mean value

L. He et al. 90

Success Ratio(%)

80

BCTO Random Greedy

80

Task Duration(seconds)

322

70 60 50 40 30 20 10 10000

20000 30000 40000 The Number of Tasks

50000

70

Greedy Random BCTO

60 50 40 30 20 10000

20000 30000 40000 The Number of Tasks

50000

Fig. 3. Impact of task number

of 2 GB. The user budget we set is 70000. The number of tasks is 10000 and the number of edge servers is 20. For the computation amount, three kinds of distribution are utilized, i.e., uniform distribution, normal distribution and pareto distribution. In the first figure of Fig. 4 and first figure of Fig. 5, when the computation amount follows uniform distribution. Compared with random offloading scheme, it is shown that our proposed BCTO scheme can improve 5% success ratio while reduce about 30% task duration. And compared with greedy offloading scheme, our scheme can improve 25%∼30% of the success ratio and reduce 40%∼50% of the task duration.

60 50 40

50 40 30

20

20

1.5 2 2.5 Average Computations Per Task(GHZ)

3

BCTO Random Greedy

80

60

30 1

90

BCTO Random Greedy

70 Success Ratio(%)

Success Ratio(%)

80

BCTO Random Greedy

70

Success Ratio(%)

80

70 60 50 40 30 20

1

1.5 2 2.5 Average Computations Per Task(GHZ)

3

10

1

1.5 2 2.5 Average Computations Per Task(GHZ)

3

Fig. 4. Impact of computation amount: success ratio under uniform distribution, normal distribution and pareto distribution

As shown in the second figure of Fig. 4 and second figure of Fig. 5, when the computation amount follows normal distribution. We can get that our scheme can improve 5% success ratio and reduce 35% task duration when compared with random offloading scheme. Compared with the greedy offloading scheme, our scheme can improve 25% of success ratio, while reduce 45% of task duration. In the third figure of Fig. 4 and third figure of Fig. 5, when the computation amount follows pareto distribution. When compared with the random offloading scheme, our scheme can improve 5% of success ratio, while reduce more than

Task Offloading in Edge-Clouds with Budget Constraint

60 50 40 30 20

1

1.5 2 2.5 Average Computations Per Task(GHZ)

60 50 40 30 20

3

70

Greedy Random BCTO

Task Duration(seconds)

70

Greedy Random BCTO

Task Duration(seconds)

Task Duration(seconds)

70

1

1.5 2 2.5 Average Computations Per Task(GHZ)

Greedy Random BCTO

60 50 40 30 20

3

323

1

1.5 2 2.5 Average Computaitons Per Task(GHZ)

3

Fig. 5. Impact of computation amount: time duration under uniform distribution, normal distribution and pareto distribution

30% task duration. Compared with greedy offloading scheme, our scheme can improve 20%∼25% success ratio, while reduce 40% task duration. 6.4

Impact of Data Size of Task Offloading

In this section, we consider the impact of data size on task offloading performance. The computation amount follows a normal distribution with mean value of 2 GHZ. The user budget is 40000. The number of tasks we set is 10000, and the number of edge servers is 20. For the data size, three kinds of distribution are utilized, i.e., normal distribution, uniform distribution and pareto distribution. As shown in Figs. 6 and 7, we can conclude that our proposed offloading scheme exhibits higher success ratio of task offloading and shorter task duration than random offloading scheme and greedy offloading scheme. In first figures of Fig. 6 and first figure of Fig. 7, when computation amount follows uniform distribution. Compared with random offloading scheme, it is shown that our proposed offloading scheme can improve 5% of the success ratio and reduce 30% of the task duration on average of task offloading. And compared with greedy offloading scheme, our scheme can improve 30% of the success ratio and reduce 40% of the task duration on average of task offloading. 85

BCTO Random Greedy

75 70 65 60 55

70 65 60 55

45 1.5

50

3.5

75

75

50 2 2.5 3 Average Data Size(GB)

80

BCTO Random Greedy

80 Success Ratio(%)

Success Ratio(%)

80

Success Ratio(%)

85

BCTO Random Greedy

70 65 60 55 50

1.5

2 2.5 3 Average Data Size(GB)

3.5

45 1.5

2 2.5 3 Average Data Size(GB)

3.5

Fig. 6. Impact of data size: success ratio under uniform distribution, normal distribution and pareto distribution

L. He et al. 140

Greedy

110 Random BCTO

Task Duration(seconds)

Task Duration(seconds)

120 100 90 80 70 60 50

100

Greedy Random 120 BCTO

Task Duration(seconds)

324

100 80 60 40

40 1.5

2 2.5 3 Average Data Size(GB)

3.5

1.5

2 2.5 3 Average Data Size(GB)

3.5

90

Greedy Random BCTO

80 70 60 50 40 30 1.5

2 2.5 3 Average Data Size(GB)

3.5

Fig. 7. Impact of data size: time duration under uniform distribution, normal distribution and pareto distribution

As shown in second figure of Fig. 6 and second figure of Fig. 7, when computation amount follows normal distribution. Compared with random offloading scheme, it is shown that our proposed offloading scheme can improve 5%∼10% of the success ratio and reduce 30% of the task duration on average of task offloading. And compared with greedy offloading scheme, our scheme can improve 25% of the success ratio and reduce 45% of the task duration on average of task offloading. In the third figure of Fig. 6 and third figure of Fig. 7, when computation amount follows pareto distribution. When compared with random offloading scheme, it is shown that our proposed offloading scheme can improve 5%∼10% of the success ratio and reduce 35% of the task duration on average of task offloading. And compared with greedy offloading scheme, our scheme can improve 25% of the success ratio and reduce 45% of the task duration on average of task offloading.

7

Conclusion

In this paper. We first formulate a budget-constraint task offloading problem for delay minimization in edge computing environments, where the edge servers are equipped with limited computation and storage resources. Then we proposed a heuristic algorithm to solve the problem we formulated. Simulation results have shown that our proposed scheme is more efficient in success ratio of task offloading and task duration compared to the random and greedy computation offloading schemes. It would be of our future interest to consider a task offloading in more complicate deployment with users mobility. Acknowledgement. This paper is supported by the NSFC under Grant No. 61472383, U1709217, and 61472385, and the Natural Science Foundation of Jiangsu Province in China under No. BK20161257.

Task Offloading in Edge-Clouds with Budget Constraint

325

References 1. Networking, V.: Cisco visual networking index: Global mobile data traffic forecast update, 2014-2019 white paper 2. Chen, Z., et al.: An empirical study of latency in an emerging class of edge computing applications for wearable cognitive assistance. In: SEC, p. 14 (2017) 3. Hu, Y.C., Patel, M., Sabella, D., Sprecher, N., Young, V.: Mobile edge computing–a key technology towards 5G. ETSI White Pap. 11(11), 1–16 (2015) 4. Truong, N.B., Lee, G.M., Ghamri-Doudane, Y.: Software defined networking-based vehicular adhoc network with fog computing. In: IFIP/IEEE International Symposium on Integrated Network Management (IM), pp. 1202–1207 (2015) 5. Barbera, M.V., Kosta, S., Mei, A., Stefa, J.: To offload or not to offload? the bandwidth and energy costs of mobile cloud computing. In: Proceedings IEEE INFOCOM, pp. 1285–1293, April 2013 6. Taleb, T., Samdanis, K., Mada, B., Flinck, H., Dutta, S., Sabella, D.: On multiaccess edge computing: a survey of the emerging 5G network edge cloud architecture and orchestration. IEEE Commun. Surv. Tutor. 19(3), 1657–1681 (2017) 7. Zhang, S., Zhang, N., Zhou, S., Gong, J., Niu, Z., Shen, X.: Energy-aware traffic offloading for green heterogeneous networks. IEEE J. Sel. Areas Commun. 34(5), 1116–1129 (2016) 8. Tan, H., Han, Z., Li, X.Y., Lau, F.C.M.: Online job dispatching and scheduling in edge-clouds. In: IEEE INFOCOM 2017 - IEEE Conference on Computer Communications, pp. 1–9, May 2017 9. Tong, L., Li, Y., Gao, W.: A hierarchical edge cloud architecture for mobile computing. In: IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, pp. 1–9, April 2016 10. You, C., Huang, K., Chae, H., Kim, B.H.: Energy-efficient resource allocation for mobile-edge computation offloading. IEEE Trans. Wirel. Commun. 16(3), 1397– 1411 (2017) 11. Chen, M., Hao, Y.: Task offloading for mobile edge computing in software defined ultra-dense network. IEEE J. Sel. Areas Commun. 36(3), 587–597 (2018) 12. Chun, B.G., Ihm, S., Maniatis, P., Naik, M., Patti, A.: Clonecloud: elastic execution between mobile device and cloud. In: Proceedings of the Sixth Conference on Computer systems, pp. 301–314. ACM (2011) 13. Claffy, K.C., Polyzos, G.C., Braun, H.W.: Application of sampling methodologies to network traffic characterization. In: ACM SIGCOMM Computer Communication Review, vol. 23, pp. 194–203. ACM (1993) 14. Gordon, M.S., Jamshidi, D.A., Mahlke, S.A., Mao, Z.M., Chen, X.: Comet: code offload by migrating execution transparently. OSDI 12, 93–106 (2012) 15. Taleb, T., Dutta, S., Ksentini, A., Iqbal, M., Flinck, H.: Mobile edge computing potential in making cities smarter. IEEE Commun. Mag. 55(3), 38–43 (2017) 16. Jia, M., Cao, J., Liang, W.: Optimal cloudlet placement and user to cloudlet allocation in wireless metropolitan area networks. IEEE Trans. Cloud Comput. 5(4), 725–737 (2017) 17. Urgaonkar, R., Wang, S., He, T., Zafer, M., Chan, K., Leung, K.K.: Dynamic service migration and workload scheduling in edge-clouds. Perform. Eval. 91, 205– 228 (2015) 18. Xiao, Y., Krunz, M.: Qoe and power efficiency tradeoff for fog computing networks with fog node cooperation. In: IEEE INFOCOM 2017 - IEEE Conference on Computer Communications, pp. 1–9, May 2017

326

L. He et al.

19. Tran, T.X., Pompili, D.: Joint task offloading and resource allocation for multiserver mobile-edge computing networks (2017). arXiv preprint arXiv:1705.00704 20. Sakellariou, R., Zhao, H., Tsiakkouri, E., Dikaiakos, M.D.: Scheduling workflows with budget constraints. Integrated Research in GRID Computing, pp. 189–202. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-47658-2 14 21. Oprescu, A.M., Kielmann, T.: Bag-of-tasks scheduling under budget constraints. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 351–359, November 2010 22. Zhu, Q., Agrawal, G.: Resource provisioning with budget constraints for adaptive applications in cloud environments. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 304– 307. ACM, New York (2010) 23. Gharooni-fard, G., Moein-darbari, F., Deldari, H., Morvaridi, A.: Scheduling of scientific workflows using a chaos-genetic algorithm. Procedia Comput. Sci. 1(1), 1445–1454 (2010) 24. Bittencourt, L.F., Madeira, E.R.M.: Hcoc: a cost optimization algorithm for workflow scheduling in hybrid clouds. J. Internet Serv. Appl. 2(3), 207–227 (2011) 25. Byun, E.K., Kee, Y.S., Kim, J.S., Maeng, S.: Cost optimized provisioning of elastic resources for application workflows. Futur. Gener. Comput. Syst. 27(8), 1011–1026 (2011) 26. Li, J., Su, S., Cheng, X., Huang, Q., Zhang, Z.: Cost-conscious scheduling for large graph processing in the cloud. In: IEEE International Conference on High Performance Computing and Communications, pp. 808–813, September 2011 27. Chen, X., Jiao, L., Li, W., Fu, X.: Efficient multi-user computation offloading for mobile-edge cloud computing. IEEE/ACM Trans. Netw. 24(5), 2795–2808 (2016) 28. Mao, Y., You, C., Zhang, J., Huang, K., Letaief, K.B.: A survey on mobile edge computing: the communication perspective. IEEE Commun. Surv. Tutor. 19(4), 2322–2358 (2017) 29. Sun, Y., Zhou, S., Xu, J.: EMM: Energy-aware mobility management for mobile edge computing in ultra dense networks. IEEE J. Sel. Areas Commun. 35(11), 2637–2646 (2017) 30. Wu, C.Q., Lin, X., Yu, D., Xu, W., Li, L.: End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3(2), 169–181 (2015) 31. Tawalbeh, L.A., Jararweh, Y., Ababneh, F., Dosari, F.: Large scale cloudlets deployment for efficient mobile cloud computing. JNW 10, 70–76 (2015)

Motion Trajectory Sequence-Based Map Matching Assisted Indoor Autonomous Mobile Robot Positioning Wenping Yu1 , Jianzhong Zhang1(B) , Jingdong Xu2 , and Yuwei Xu1 1

College of Cyberspace Security, Nankai University, Tianjin, China [email protected], {zhangjz,xuyw}@nankai.edu.cn 2 College of Computer Science, Nankai University, Tianjin, China [email protected]

Abstract. Position information is one of basic elements for context awareness of autonomous mobile robots. This paper studies the positioning algorithm of autonomous mobile robots suitable for search and rescue in dark building corridors and underground mine tunnels when an emergency occurs, and proposes a novel map matching aided positioning algorithm based on a Hidden Markov Model. This algorithm does not rely on a camera, and only uses the inertial sensors installed in mobile robot and the indoor map to realize the fusion of dead reckoning and map matching. Firstly, it detects the position-related motion postures during the motion process, and then the motion trajectory is divided into a subtrajectory sequence. By matching the sub-trajectory sequence with the indoor map, the proposed algorithm achieves tracking and positioning of the mobile robot. In order to verify the effectiveness of the proposed algorithm, this paper adopts four-wheel differentially driven robot to conduct experimental analysis in an actual indoor scenario. The experimental results show that compared with the traditional dead reckoning technology, this algorithm can distinctly reduce the average positioning error of mobile robot, and it is robust to heading angle noises within a certain error range. Keywords: Mobile robot · Indoor positioning Hidden Markov Model · Posture pattern detection

1

Introduction

With the advancement of artificial intelligence, network and sensor technologies, the research and application of autonomous mobile robots have made remarkable progress in recent years. Indoor autonomous mobile robots are increasingly integrated into people’s daily lives [1]. Autonomous mobile robots can be extensively used not only in modern intelligent warehouses, home services and many other aspects, but also in corridors of complex buildings, tunnels of subway c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 327–341, 2018. https://doi.org/10.1007/978-3-030-05057-3_26

328

W. Yu et al.

and underground mines when accidents occur. Therefore, the research of indoor autonomous mobile robot technology has gradually become a hot topic, and many domestic research institutes such as Tsinghua University, Harbin Institute of Technology, Nankai University and South China University of Technology are committed to the research and development of indoor autonomous mobile robots [2–6]. The autonomous positioning of the mobile robot is a process in which the robot autonomously determines its position in the working environment, and is one of the most basic problems in improving the autonomous capabilities of the mobile robot. In terms of outdoor positioning, the Global Positioning System (GPS) has become a widely used positioning technology for mobile robots. However, in terms of indoor positioning, due to the blocking and interference of GPS signals by the external walls of buildings and indoor complex electromagnetic environment, there is no universal solution to the positioning problem of indoor mobile robots [7,8]. Currently, researchers have proposed a variety of positioning methods for indoor autonomous mobile robots, including navigation beacon-based positioning [9], computer vision-based positioning [10,11], dead reckoning positioning [12], map matching positioning [13,14] and simultaneous localization and mapping (SLAM) [15,16], and so on. Positioning techniques based on navigation beacons rely on a series of deployed feature signals to provide the stable and accurate location information, but require high deployment and maintenance costs. Dead reckoning technique uses inertial sensors or encoders to provide relatively accurate positions over short distances, but exists cumulative error that gradually increases as the distance travels, and the robot’s starting point needs to be known in advance. Map matching positioning uses known indoor maps to construct topological maps, feature maps, and other abstract maps, and then the position of the mobile robot is obtained by matching the robot motion trajectory with the indoor maps. The real-time performance of map matching is relatively poor according to its realization principle. The SLAM technology has unique advantages in the face of unknown environments and can provide indoor floor plans or 3D maps while providing positioning [17]. However, this method requires mobile robots equipped with more complex sensor devices, such as infrared, ultrasonic radar and RGB-D vision systems. Therefore, it has higher implementation cost. Corridors of buildings, subway station tunnels and underground mines often have complex passageways, similar to “mazes”. In the event of an accident such as a fire, the power supply is damaged, the communication infrastructure becomes unusable and smoke and dust cause the lack of indoor lighting, and so on. All these situations pose challenges for the positioning of indoor autonomous mobile robots. Due to limitations in working environment or deployment conditions, it is difficult to establish visual or wireless navigation beacons in advance. Therefore, positioning technology based on navigation beacons is not suitable; the influence of high temperature and smoke on the indoor environment makes it difficult for cameras to provide image information, visual positioning technology fails; the timeliness of SLAM technology can not meet the urgent need for time

Motion Trajectory Sequence-Based Map Matching Assisted

329

factors in the above scenarios. In response to these problems, this paper introduces a hidden Markov model (HMM) based map matching algorithm that does not rely on a camera, only uses inertial sensors (accelerometer, gyroscope, and magnetometer) installed in autonomous mobile robots and known indoor maps to effectively track and position mobile robots.

2

Robot Motion Model and Positioning Method

In the field of indoor autonomous mobile robot positioning, dead reckoning technology and map matching technology have a good complementarity. This paper proposes a map matching-assisted positioning method based on motion trajectory sequence of mobile robot to realize the fusion of the above two technologies. The positioning algorithm uses stairs and corridor corners in the indoor environment as virtual landmarks. When the mobile robot passes through these landmarks, the inertial sensor data will show a specific pattern. Therefore, in this paper, the above landmarks are called posture-related positions. When the robot’s movement distance is short, the dead reckoning technology can give the real-time position of the robot. When the robot’s movement distance is long, the robot’s motion trajectory can be divided into multiple sub-trajectories according to the landmarks, consecutive sub-trajectories form sub-trajectory sequence. With the help of HMM model, the above sub-trajectory sequence can be matched to the corresponding road in the indoor map, and then the position estimation of the mobile robot is given. Further, when the robot’s motion trajectory is long enough, the absolute position of the mobile robot can still be estimated even without knowing the robot’s starting point. 2.1

Robot Motion Model and Its Dead Reckoning Algorithm

In this paper, a four-wheel differential-driven mobile robot is used to study the positioning problem of autonomous mobile robots in indoor environment. The driving motor is a direct-current (DC) motor. The two driving motors on one side are connected in reverse parallel and use the L298N motor driving module to control the DC motor, and the mobile robot adopts the Raspberry Pi B.V1.2 as the main control chip. According to the driving mode of the mobile robot, the motion models of the two wheels on each side of the wheeled robot are the same. Therefore, the motion model of the mobile robot can be simplified to a left and right two-wheel differential driving mode. Figure 1(a) shows the simplified motion model of the mobile robot, where (x, y) is the position coordinate of the mobile robot in the global coordinate system, Θ is the angle between heading direction of the mobile robot and the true north direction. The autonomous mobile robot used in this paper has built-in digital compass, three-axis accelerometer and gyroscope. The digital compass gives the initial attitude of the mobile robot. The accelerometer and the gyroscope can measure the movement acceleration and rotation angular velocity of the mobile robot.

330

W. Yu et al.

Fig. 1. Simplified motion model and its dead reckoning principle for four-wheel differentially driven robot. (a) Motion model and self coordinate system. (b) Dead reckoning in the global coordinate system.

The distance and heading direction change of the mobile robot can be obtained by integration, then we can derive the latest position and posture of the mobile robot. In order to determine the position and posture of the mobile robot in the plane, we establish the global coordinate system OXY . Assuming that the starting point (x0 , y0 ) is the origin of the coordinates and the starting attitude is the positive direction of the X-axis, then the position and posture of mobile robot at the k time can be expressed by vector (vk , θk , xk , yk )T , where vk denotes the instantaneous velocity of the mobile robot, θk denotes heading direction of the mobile robot and xk , yk denote the coordinates of the mobile robot in the global coordinate system, as shown in Fig. 1(b). When the update cycle of sensor data is very small, such as 5 ms in this paper, in one cycle, the trajectory of mobile robot can be approximated to a straight line, then the position of mobile robot at the k time can be recursively obtained by the Eq. 1. ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ vk−1 0.5(ak−1 + ak )Δt vk ⎜ θk ⎟ ⎜ θk−1 ⎟ ⎜ 0.5(ωk−1 + ωk )Δt ⎟ ⎟ ⎜ ⎟,k ≥ 1 ⎜ ⎟=⎜ (1) ⎠ ⎝ xk ⎠ ⎝ xk−1 ⎠ + ⎝ dk cos θk−1 yk yk−1 dk sin θk−1 where, Δt is the time interval from the k-1 time to the k time, if the sensor fixes the data update period, Δt also represents the update period, ak indicates the instantaneous acceleration in the direction of the mobile robot at the k time, which can be measured by the Y -axis component of the accelerometer, ωk indicates the angular velocity of the heading direction at the k time, which can be measured by the Z-axis of the gyroscope, and dk indicates the movement distance of the mobile robot from k − 1 to k, which can be drawn from the following Equation: dk =

vk−1 + vk Δt 2

(2)

Motion Trajectory Sequence-Based Map Matching Assisted

2.2

331

Architecture Overview

The overall architecture of mobile robot positioning algorithm presented in this paper is shown in Fig. 2. The sensor data and the indoor floor plan are used as the input of the positioning algorithm. The sensor data is collected by the inertial sensors of the mobile robot and the indoor floor plan is obtained by manually input or the indoor electronic map construction algorithm such as SLAM technology. The indoor floor plan abstraction module translates the indoor floor plan into a directed graph, and the dead reckoning module and the motion posture detection module use the sensor data to give the relative displacement and position-related postures of the mobile robot respectively. The goal of map matching module is to match the motion trajectory of the mobile robot with the sequence of nodes in the directed graph, and then estimate the real time position of the mobile robot. First, the road segments are selected according to the heading direction estimation and the connection of road segments. Secondly, this algorithm updates the related parameters in the hidden Markov model according to the latest candidate road segments. Finally, it estimates the possibilities of all the alternative roads through the Viterbi decoder. When the proposed algorithm is in the convergence stage, the most possible alternative road is the optimal estimation. The final output of the algorithm proposed in this paper is the real-time position and heading direction of mobile robot.

Motion Trajectory Sequence-Based Map Matching Assisted Indoor Autonomous Mobile Robot Positioning

Motion Sensors, Compass

Indoor Floor Plan

Dead Reckoning

Candidate Paths Selection

Motion Posture Detection

HMM-Based Map Matching

Floor Plan Abstraction

Viterbi Decoder

Positioning Results of Mobile Robot Timestamp, Position and Heading

Fig. 2. System architecture of mobile robot positioning algorithm.

2.3

Indoor Floor Plan Abstraction

The posture-related positions divide the indoor roads into road segments. Taking the road segment as a node, the posture change pattern from one road segment to another as a directed edge, the indoor floor plan can be abstracted as a directed graph. Figure 3 shows an example of indoor floor plan and its corresponding directed graph. In this paper, a node is represented by the tuple

332

W. Yu et al.

(id, x1 , y1 , x2 , y2 , ϕ1 , ϕ2 ), where xi , yi , i = 1, 2 represent coordinates of the two endpoints of the road segment, ϕ1 , ϕ2 represent the heading direction when the mobile robot moves on the road segment and reaches the corresponding endpoint. A tuple (id1 , id2 , x, y, change of motion attitude(M A)) represents a directed edge between nodes, where, id1 denotes the identity of the starting node, id2 denotes the identity of the end node and x, y,MA represent the coordiantes of position and the change of motion attitude from the starting node to the end node, respectively.

s4

s6

s7

s3

s5

s1

s6

s2

s7

s4

s5

s1

s3

s2

Fig. 3. Indoor floor plan example and its corresponding directed graph.

2.4

Motion Posture Detection

This section presents a decision tree model for the motion posture detection of indoor mobile robots. This paper focuses on the positioning of mobile robots in indoor 2D plane. Therefore, our decision tree considers only the relevant postures detection of a mobile robot on the plane, including stationary, go straight, left/right turns and U-turns. The horizontal movement posture of the indoor autonomous mobile robot is distinguished by different modes of the horizontal component of the accelerometer and the vertical component of the gyroscope in the mobile robot, where the horizontal component of the accelerometer is the data of Y -axis in the local coordinate system of the mobile robot and the vertical component of the gyroscope is the data of Z-axis in the robot’s local coordinate system. Furthermore, by extracting the vertical component of the accelerometer, this method can be easily extended to three-dimensional indoor positioning scenes. Figure 4 shows a decision tree for the motion posture detection of an indoor autonomous mobile robot. The decision tree uses the signal characteristics of built-in acceleration and gyroscope to identify different motion posture patterns of the mobile robot. Considering that the linear velocity of the mobile robot is obtained from the integration of the acceleration horizontal component in time, the instantaneous velocity has an accumulated error at a certain moment. Therefore, the top layer of the decision tree uses the variance of the acceleration horizontal component to separate the stationary and going straight; the second level of the decision tree uses the rotation rate measured on the Z-axis of the

Motion Trajectory Sequence-Based Map Matching Assisted

Var. Acc. Low?

No Rot. Rat. Low?

No

No

U-Turn

Rot. Ang. Low?

Yes

333

Yes Stationary

Yes Go Straight

Right/Left Turn

Fig. 4. Decision tree for motion posture detection.

gyroscope to separate turns and going straight; finally, the third level of the decision tree uses the rotation angle to separate the U-turn from the left or right turn.

3

HMM Based Map Matching Algorithm

With the help of the motion posture detection based on inertial sensor data, the motion trajectory of the mobile robot can be divided into sub-trajectory segments by position-related postures, such as left or right turn and U-turn, and these sub-trajectory segments form a sub-trajectory sequence in the time dimension. This section gives a detailed description of Hidden Markov Models for matching sub-trajectory sequence to indoor abstract graph. 3.1

Hidden Markov Model

A Hidden Markov Model is a time-series probability model that describes the state of a process using discrete random variables. A basic HMM can be represented as λ = (S, V, A, B, π), where: (1) S = {s1 , s2 , s3 , . . . , sN } is the set of possible hidden states and N = |S|. In our case, each state represents an indoor road segment, that is, a node of the directed graph. Therefore, a state s is represented by the tuple in the form of (id, x1 , y1 , ϕ1 , x2 , y2 , ϕ2 ), where id is the identification of road segment, x1 , y1 , ϕ1 , x2 , y2 and ϕ2 are different attributes of node of the directed graph, respectively. It should be noted that if the mobile robot can reach another road segment by going straight from one road segment, these two road segments can be merged into a new road segment, also a new hidden state. The road segments s4 , s6 can be combined into new road segments as shown in Fig. 5.

334

W. Yu et al.

(2) V = {v1 , v2 , v3 , . . . , vM } is the set of observations from the model and M = |V |. In our case, an observable state represents the relative movement distance and heading direction measured by motion sensors installed in the mobile robot and is represented in terms of (dist, ϕ). (3) A = {aij } is the state transition probability distribution, where aij = p {qt+1 = sj |qt = si } , i, j ≤ N , where qt denotes the state at time t. In other words, aij indicates the possibility of moving from one road segment to adjacent road segments. (4) B = {bi (k)} is the observation probability distribution in state i, where bi (k) = p{zt = vk |qt = si }, 1 ≤ i ≤ N, 1 ≤ k ≤ M and zt , qt are the observation and state at time t, respectively. In other words, bi (k) indicates the possibility of a certain distance and heading direction measured by the inertial sensors after the mobile robot has passed a road segment. (5) π = {πi } is the initial state distribution, where πi = p {q1 = Si }. 3.2

Transition Probability Distribution (A)

The transition probability distribution refers to the possibility of moving from a hidden state to the next hidden state. In this paper, it also means the possibility of moving from one road segment to the adjacent road segment. The adjacent road segments are divided by posture-related positions. Each posture-related position has a corresponding motion posture. The higher the degree of matching between the mobile robot’s motion posture and position-related posture is, the greater the probability that the mobile robot moves from one road segment to another road segment through this posture-related position is, and vice versa. Therefore, we use the degree of matching between the motion posture of the mobile robot and position-related posture to represent the transition probabilities between adjacent road segments. Let eij denote the edge of the directed graph from si to sj , the corresponding position-related posture can be represented by eij .MA according to definition in Sect. 2.3. Given the motion posture of the mobile robot RobM A (t) at time t. The probability from si to sj is shown in Eq. 3, where p(RobM A (t)|eij .M A) can be obtained from the motion posture confusion matrix in Sect. 2.4. p(sj,t |si,t−1 ) = p(sj |si , RobM A (t)) = p(RobM A (t)|eij .M A) 3.3

(3)

Observation Probability Distribution (B)

In this paper, an observable state consists of the relative displacement of the mobile robot and the heading direction, and the two are independent of each other. Therefore, the observable probability distribution can be defined as: P (vk,t |sj,t ) = P (ϕ(t)|sj,t ) · P (dist(t)|sj,t )

(4)

where, P (ϕ(t)|sj,t ) represents the observable probability of the mobile robot’s heading direction at time t, and P (dist(t)|sj,t ) denotes the observable probability determined by the relative displacement of the mobile robot.

Motion Trajectory Sequence-Based Map Matching Assisted

335

The higher the degree of matching between the heading direction of the mobile robot and the road segment, the greater the possibility that the mobile robot is located in the road segment. In the indoor environment, the error of the heading direction of the mobile robot not only comes from the accumulation error, but also comes from the interference of various metal materials in the buildings. In general, the error of the heading direction is relatively large and it is difficult to accurately model this error. Therefore, in this paper, Eq. 5 is used to model the heading direction in the observable state. P (ϕ(t)|sj,t )



= P {ϕ(t)|sj .ϕi , i = 1, 2} =

1, if |ϕ(t) − sj .ϕi | < HT H , i = 1, 2 0, others

(5)

where, HT H is a constant threshold used to determine whether the heading direction of the mobile robot matches the direction of the road segment or not. In order to avoid that the correct road segment is excluded due to the large error of the heading direction, HT H is set to 59◦ in this paper. The relative displacement error of the mobile robot mainly comes from the accumulative error caused by the acceleration error in the dead reckoning process. Here we assume that the relative displacement of the mobile robot obeys the Gaussian distribution. On the one hand, intuitively speaking, the closer the relative displacement of the mobile robot and the length of the road segment is, the more likely the mobile robot is located in the road segment; on the other hand, For the road segments whose lengths are much larger than the relative displacement of the mobile robot, all of them should have the same possibility. Combining the above two situations, this paper uses Eq. 6 to model the relative displacement in the observable state. P (dist(t)|sj,t ) = P {dist(t)|sj .dist} =

⎧ 1 ⎨ √2πσ e−4.5 , dist(t) + 3σd ≤ sj .dist d

− 1 e 2πσd

⎩√

(dist(t)−sj .dist)2 2σd 2

(6)

, others

where sj .dist is the length of the road segment, which can be derived from the two endpoints of the road segment sj and σd is the standard deviation of the relative displacement of the mobile robot at time t. In order to estimate the value of σd , this paper firstly tests the change of the accelerometer’s value Δa when the mobile robot is stationary, and estimates the standard deviation of the acceleration σa based on the absolute median error (MAD) of the test data [18]. It can be inferred that there is a secondary relationship between σd and σa according to the principle of the dead reckoning described in Sect. 2.1. σa = 1.4826 × median(|Δa|)

(7)

336

3.4

W. Yu et al.

Initial State Distribution

If the starting point of the mobile robot is already known, then the road segment where the starting point is located is the initial state, and the probability is set to 1; if the starting point of the moving robot is unknown, all candidates can be selected by Eq. 5 based on the initial heading direction information of the mobile robot and the initial probability distribution is a uniform distribution over the candidate road segments. 3.5

Optimal Motion Trajectory Estimation

Based on the above-defined Hidden Markov Model, this paper uses Viterbi algorithm to determine the optimal estimation of the moving trajectory of a mobile robot. For a given observable state sequence (z1 , z2 , ..., zk ), the goal of the Viterbi algorithm is to find the most possible hidden state sequence (q1 , q2 , ..., qk ). Figure 5 briefly illustrates the decoding process of the Viterbi algorithm. z1

s4

s6

s7

s3

s5

z2

z3

s1

s1

p(z3|q3)

π1

s3

s2

s1

s7

π2

s5

s4

s6

s5

p(s4|s3)

s6

s6

p(s7|s6)

s2

s1

Motion Trajectory of Mobile Robot

q1=s3

q2=s4+s6

q3=s7

Fig. 5. Illustration of the proposed HMM model Viterbi decoding.

The Viterbi decoder is implemented by the dynamic programming method. First, a Viterbi variable is defined to represent the maximum probability that the Hidden Markov Model will reach the state si along a path at time t: δt (i) = max P {q1 , q2 , · · · , qt = si , z1 , z2 , · · · , zt |λ}

(8)

At time t+1, the maximum probability reaching the hidden state sj can be recursively derived from the Viterbi variable at time t by the following equation. δt+1 (j) = [max(δt (i) · P {qt+1 = si |qt = sj })] · P {zt+1 |qt+1 }, 1 ≤ t ≤ k i

(9)

By recording the backward pointers, at time k, the most likely hidden state sequence, that is, the optimal estimation of the motion trajectory of the mobile robot can be obtained by the path backtracking method.

Motion Trajectory Sequence-Based Map Matching Assisted

4

337

Evaluation

We uses the wheeled mobile robot described in Sect. 2 to complete the experimental analysis. The experimental environment is the fifth floor of a teaching hall on our campus. The experimental area is divided into east and west parts, and the east part is approximately 84.85 ∗ 66.8 (m2 ). The west part is approximately 68.7 ∗ 106.75 (m2 ), the length of connecting corridor between two parts is 46.25 m and the width is 2.4 m. The overall layout is shown in Fig. 6. In order to record the real position of the robot during the movement, this article divides the experimental area into squares of 0.8 ∗ 0.8 (m2 ) and marks every small areas. In the experiment process, another mobile robot with camera is used to move in parallel with the robot to record real-time positions. In the experimental area, the robot moves along the two planned trajectories denoted as T1 and T2 in Fig. 6. The length of T1 is 181.9 m, including 3 posture-related positions. The whole trajectory is divided into 4 sub-trajectories. The length of T2 is 180.5 m, including 2 posture-related positions, the whole trajectory is divided into 3 subtrajectories, and the mobile robot repeats 9 times for each trajectories.

T2 End Point T1 End Point

T1 Starting Point

T2 Starting Point

Fig. 6. Indoor floor plan of experimental environment and mobile robot trajectories.

4.1

Influence of Heading Direction Errors

When the starting point is unknown, the convergence performance of the algorithm is closely related to the detection results of position-related postures. In general, after at least correctly detected two consecutive position-related postures, the map matching algorithm based on the posture detection is likely to converge. Facing the position-related posture detection in 2D floor plan, such as left and right turning, the error of the heading direction has a greater impact

338

W. Yu et al.

on the detection results. Therefore, this section first analyzes the influence of heading direction error on the convergence performance of map matching algorithm. Here, we first define the precision of position-related posture detection by Eq. 10. N umber of Correctly Detected Consecutive T wo P ostures T otal N umber of Consecutive T wo P ostures (10) Supposing that the distribution of the heading direction estimation error obeys Gaussian and the average is 0. Based on the raw data of the heading direction, a Gaussian random value is added to simulate different degrees of error. Figure 7 shows the variation of the precision of position-related posture detection under different values of the standard deviation of the heading direction error. It can be seen from Fig. 7 that the precision of position-related posture detection is stable under certain heading direction error conditions, but when the standard deviation of heading direction error reaches a certain level (T1 is 40◦ and T2 is 30◦ ), the precision drops rapidly. P recison =

Fig. 7. Heading errors on the precision of mobile robot posture detection.

4.2

Convergence Speed Analysis Without Knowing the Starting Point

If the starting point is unknown, after the mobile robot moves a certain distance, the algorithm can still converge and finally estimate the real-time position of the mobile robot. The distance before convergence of the algorithm represents the convergence performance of the positioning algorithm. In order to evaluate the convergence performance of the proposed map matching algorithm, we compare the proposed map matching algorithm with the semMatch algorithm proposed in [18] because semMatch has certain similarities with the algorithm presented in this paper. The hidden Markov model is also used to implement map matching operations in semMatch, however, the details of the HMM model are slightly different.

Motion Trajectory Sequence-Based Map Matching Assisted

339

Fig. 8. Distance traveled before convergence for each trajectory.

Using the same decision tree model to detect the position-related postures in the trajectory of mobile robot, Fig. 8 shows the convergence performance of the two algorithms. On the one hand, for T1 , both algorithms reach the convergence state after passing through two posture-relate positions. However, the algorithm proposed in this paper needs to observe the subsequent road segment after detect the corresponding posture, so the convergence performance is slightly worse. When it reaches the convergence state, the mobile robot moves 1.9 m more. On the other hand, for T2 , the algorithm proposed in this paper reaches the convergence state shortly after the correct detection of the first position-related posture, but semMatch does not converge due to the symmetry of indoor road network denoted by ∞ in Fig. 8. The main reason is the difference in the HMM model definition of the two map matching algorithms. The hidden state of the map matching algorithm proposed in this paper is the straight road segment in the indoor road network. For T2 , the mobile robot firstly pass a long enough road segment, and the proposed algorithm combines this observable state with the subsequent detection of the first position-related posture to achieve the convergence. 4.3

Online Positioning Performance with Knowing the Starting Point

If the starting point is already known, the proposed algorithm does not need to pass the motion trajectory matching stage to converge. After convergence, the algorithm can track the moving trajectory of the mobile robot in real time. We use the Euler distance between the real position of the mobile robot and the position estimate given by the algorithm to analyze the real-time positioning performance. Figure 9 shows the variation of positioning error of the mobile robots with increasing distances on both T1 and T2 trajectories. When the motion trajectory does not include posture-related positions, the map matching assisted positioning technology proposed in this paper is equivalent to the traditional dead reckoning technology, but after detecting the posturerelated positions, the known coordinates of the posture-related positions can be used to calibrate the real-time position estimation of the robot. The real-time

340

W. Yu et al.

Fig. 9. Online positioning errors for each trajectory.

positioning results of the mobile robot on T1 and T2 shown in Fig. 9 verify this trend. For T1 , the average positioning error decreases from 4.0 m to 2.49 m, while for T2 , the average positioning error decreases from 6.58 m to 3.39 m. From the experimental results, it can be deduced that in the actual environment, with the density of posture-related positions increasing, the improvement of positioning performance of the proposed algorithm is more obvious.

5

Conclusion

In order to solve the difficult problem of positioning of autonomous mobile robots in dark complex building corridors, subway tunnels, or underground mines after sudden accident such as a fire, this paper proposes an indoor autonomous mobile robot tracking and positioning algorithm based on a novel hidden Markov Model. In the structured indoor environment, this method uses the detection of positionrelated postures to match the motion trajectory of mobile robot to the abstraction of indoor floor plan. Compared with the traditional dead reckoning technology, the proposed algorithm can significantly reduce the influence of cumulative errors on the positioning accuracy, and is robust to the heading direction and acceleration value noises within a certain error range. This algorithm does not rely on cameras, and uses only motion sensors installed in autonomous mobile robots and known indoor floor plan to achieve fusion positioning of dead reckoning and map matching techniques, even when the starting point is unknown. This algorithm has the characteristics of simple deployment, low manufacturing cost and easy operation. Acknowledgment. This work was supported by the National Natural Science Foundation of China (No. 61702288), the Natural Science Foundation of Tianjin in China (No. 16JCQNJC00700) and the Fundamental Research Funds for the Central Universities.

Motion Trajectory Sequence-Based Map Matching Assisted

341

References 1. Garcia, E., Jimenez, M.A., De Santos, P.G., Armada, M.: The evolution of robotics research. Robot. Autom. Mag. IEEE 14(1), 90–103 (2007) 2. Wu, J., Li, T.M., Tang, X.Q.: Robust trajectory tracking control of a planar parallel mechanism. J. Tsinghua Univ. 5, 642–646 (2005) 3. Wu, J., Wang, D., Wang, L.: A control strategy of a two degrees-of-freedom heavy duty parallel manipulator. J. Dyn. Syst. Meas. Contr. 137(6), 061007 (2015) 4. Yang, J., Yang, J., Cai, Z.: An efficient approach to pose tracking based on odometric error modelling for mobile robots. Robotica 33(6), 1231–1249 (2015) 5. Yuan, X., Wang, D., Yan, Y.: Self-positioning of robot based on dead reckoning and ultrasonic data fusion (in chinese). J. Naval Univ. Eng. 21(5), 67–72 (2009) 6. Yu, N., Wang, S., Xu, C.: RGB-D based autonomous exploration and mapping of a mobile robot in unknown indoor environment. Robot 39(6), 860–871 (2017). (in chinese) 7. Bachrach, A., De Winter, A., He, R., Hemann, G.: Range - robust autonomous navigation in GPS-denied environments. In: IEEE International Conference on Robotics and Automation, pp. 1096–1097. IEEE (2011) 8. Bao, H., Wong, W.C.: An indoor dead-reckoning algorithm with map matching. In: 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC), pp. 1534–1539. IEEE (2013) 9. Tang, H., Chen, W., Wang, J.: Artificial landmark distribution based on multi-ary m-sequence. Robot 36(1), 29–35 (2014). (in chinese) 10. Lu, Y., Song, D.: Visual navigation using heterogeneous landmarks and unsupervised geometric constraints. IEEE Trans. Robotic. 31(3), 736–749 (2015) 11. Gao, X., Zhang, T.: Unsupervised learning to detect loops using deep neural networks for visual slam system. Auton. Robots 41(1), 1–18 (2017) 12. Kim, J.H., Lee, J.C.: Dead-reckoning scheme for wheeled mobile robots moving on curved surfaces. J. Intell. Robotic Syst. 79(2), 211–220 (2015) 13. Grisetti, G., Stachniss, C., Burgard, W.: Improved techniques for grid mapping with rao-blackwellized particle filters. IEEE Trans. Robotics 23(1), 34–46 (2007) 14. Cheng, H., Chen, H., Liu, Y.: Topological indoor localization and navigation for autonomous mobile robot. IEEE Trans. Autom. Sci. Eng. 12(2), 729–738 (2015) 15. de la Puente, P., Rodr´ıguez-Losada, D.: Feature based graph-slam in structured environments. Auton. Robots 37(3), 243–260 (2014) 16. Havangi, R., Taghirad, H.D., Nekoui, M.A., Teshnehlab, M.: A square root unscented fastslam with improved proposal distribution and resampling. IEEE Trans. Ind. Electron. 61(5), 2334–2345 (2014) 17. Richter, C., Vega-Brown, W., Roy, N.: Bayesian learning for safe high-speed navigation in unknown environments. In: Bicchi, A., Burgard, W. (eds.) Robotics Research. SPAR, vol. 3, pp. 325–341. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-60916-4 19 18. Aly, H., Youssef, M.: Semmatch: road semantics-based accurate map matching for challenging positioning data. In: The 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, p. 5. ACM (2015)

Towards the Independent Spanning Trees in the Line Graphs of Interconnection Networks Baolei Cheng1,2,3 , Jianxi Fan1,2(B) , Xiaoyan Li1 , Guijuan Wang1 , Jingya Zhou1 , and Yuejuan Han1 1

2

School of Computer Science and Technology, Soochow University, Suzhou 215006, China {chengbaolei,jxfan,jy zhou,hyj}@suda.edu.cn, {xyli,20164027004}@stu.suda.edu.cn Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing 21000, Jiangsu, China 3 Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China

Abstract. Node/edge-Independent spanning trees (ISTs) have attracted a lot of attention in the past twenty years. Many results such as edge-disjoint Hamilton cycles, traceability, number of spanning trees, structural properties, topological indices, etc, have been obtained on line graphs, and researchers have applied the line graphs of some interconnection networks into data center networks, such as SWCube, BCDC, etc. However, node/edge conjecture is still open for n-node-connected interconnection network with n ≥ 5. So far, results have been obtained on a lot of special interconnection networks, but few results are reported on the line graphs of them. In this paper, we consider the problem of constructing node-ISTs in a line graph G of an interconnection network G . We first give the construction of node-ISTs in G based on the edgeISTs in G. Then, an algorithm to construct node-ISTs in G based on the edge-ISTs in G is presented. At the end, simulation experiments on the line graphs of hypercubes show that the maximal height of the constructed node-ISTs on the line graph of n-dimensional hypercube is n + 1 for n ≥ 3. Keywords: Independent spanning trees Line graph · Interconnection network

1

· Internally disjoint paths

Introduction

Node/edge-Independent spanning trees (ISTs) can be used in reliable communication protocols [2,20], one-to-all broadcasting [29], multi-node broadcasting [4], reliable broadcasting, and secure message distribution [3]. Therefore, the problem to construct multiple node/edge-ISTs for a given interconnection network is becoming an important issue. c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 342–354, 2018. https://doi.org/10.1007/978-3-030-05057-3_27

ISTs in the Line Graphs of Interconnection Networks

343

We focus on the well-known two conjectures on the existence of ISTs in any interconnection network [20,33] as follows: Conjecture 1. Given an n-node-connected interconnection network G with n ≥ 1, there exist n node-ISTs rooted at an arbitrary node in G. Conjecture 2. Given an n-edge-connected interconnection network G with n ≥ 1, there exist n edge-ISTs rooted at an arbitrary node in G. Khuller and Schieber gave a proof that if any n-node-connected interconnection network has n node-ISTs, then any n-edge-connected interconnection network has n edge-ISTs [21]. However, Gopalan and Ramasubramanian found a counterexample to disprove Khuller and Schieber’s results [12]. Thus, either the node conjecture implies the edge conjecture or vice versa is still an open problem. For any interconnection network with n ≤ 4, Conjectures 1 and 2 were solved in [9,10,13,16,20,33]. For n ≥ 5, Conjectures 1 and 2 have been solved for some restricted classes of networks, such as planar networks [17], product networks [26], hypercubes [30,32], locally twisted cubes [25], crossed cubes [5–7], M¨ obius cubes [8], even networks [22], odd networks [23], Gaussian networks [18], etc. The line graph has received much attention by researchers in recent years. Results have been reported on edge-disjoint Hamilton cycles [24], traceability [28], number of spanning trees [11], structural properties [14], topological indices [27], treewidth [15], clique-perfectness [1], etc. Line graphs have applications in some data center networks by deploying servers on the edge of the original interconnection networks, such as SWCube [19], BCDC [31], etc. However, few results have been reported on the topic of independent spanning trees on line graphs. In this paper, we first adopt the definition of line graph G of n-edge-connected interconnection network G . We mainly obtained the following results: 1. If there are n edge-ISTs rooted at an arbitrary node in G , then there are n node-ISTs rooted at an arbitrary node in G. 2. An algorithm to construct n node-ISTs rooted at an arbitrary node in G based on the n edge-ISTs rooted at an arbitrary node in G is presented. 3. Some simulation results on the line graphs of hypercubes based on Java and JUNG technology are shown. Finally, we pointed out that the algorithm proposed in this paper can be used to construct node-independent spanning trees on SWCube and BCDC data center networks.

2 2.1

Preliminaries Graph Terminology and Notation

An interconnection network can be abstracted as a graph G(V (G), E(G)), where V (G) denotes the node set and E(G) denotes the edge set. In this paper, graphs

344

B. Cheng et al.

and networks are used interchangeably. We can also use the decimal numbers to denote the nodes in G. Two x, y-paths P and Q started at x and ended with y are edge-disjoint if E(P ) ∩ E(Q) = ∅. Two x, y-paths P and Q are internally node-disjoint if they are edge-disjoint and V (P ) ∩ V (Q) = {x, y}. Two spanning trees T1 and T2 , rooted at the same node u in G, are edge-independent if the u, vpath in T1 and the u, v-path in T2 are edge-disjoint for each v ∈ V (G)\{u}. Two spanning trees T1 and T2 rooted at u in network G are node-independent if the u, v-path in T1 and the u, v-path in T2 are internally node-disjoint for each v ∈ V (G)\{u}. Clearly, if two trees T1 and T2 are node-independent spanning trees, then they are also edge-independent spanning trees. We can also use path(u, v, T ) to denote the u, v-path in a tree T rooted at node u. A set of spanning trees rooted at the same node in G are edge-independent (resp., node-independent) if they are pairwisely edge-independent (resp., nodeindependent). We also use node-ISTs (resp., edge-ISTs) for short to represent node-independent spanning trees (resp., edge-independent spanning trees). 2.2

A Class of Networks—Line Graphs

Given a network G , its line graph G is a graph such that each vertex of G represents an edge of G and two vertices of G are adjacent if and only if their corresponding edges share a common endpoint (which are incident) in G . Now we provide Transformation 1 to demonstrate the construction of a line graph based on an existing network. Transformation 1. Given a network G , we construct the line graph G by the following steps: (1) For every edge started from node x and ended at y in E(G ), add a node [x, y] to network G, which is referred to as edge-node. (2) For every two adjacent edges (x, y) and (y, z) in G , connect [x, y] with [y, z] in G. Figure 1 shows the network G and its line graph G. Network G is derived from network G , where the number of edges in G equals to the number of nodes in G. Enlightened by Conjectures 1 and 2, the following interesting problem is naturally proposed. Problem 1. Given n edge-ISTs in an n edge-connected network G , can we construct n node-ISTs in the line graph of G ? In the following section, we try to answer this question by providing a general algorithm for any n-edge-connected network and its line graph.

ISTs in the Line Graphs of Interconnection Networks

345

Fig. 1. A network G and its line graph G.

3

Node-Independent Spanning Trees in Line Graphs

In this section, we first propose an algorithm, called NodeIST, to construct n node-ISTs in its line graph based on the n edge-ISTs in G . Then, we prove that the n trees obtained by Algorithm NodeIST based on Transformation 1 are n node-ISTs. 3.1

Construction Algorithm of Node-Independent Spanning Trees for Line Graphs

We now present an algorithm, called NodeIST, to construct n node-ISTs T1 , T2 , . . . , Tn rooted at node [u, v] in the line graph G of G , based on the n edgeISTs T1  , T2  , . . . , Tn  rooted at u in n-edge-connected network G and an edge (u, v). Since (u, v) and (v, u) are the same edge in G , we will let [u, v] and [v, u] denote the same node in G. For simplicity, we will always let an edge started at a smaller node and ended with a bigger node in the examples shown in Fig. 2. In Algorithm NodeIST, Step 1 is called to initialize trees T1 , T2 , . . . , Tn . By Step 2, the edge started at the root node [u, v] in each tree is determined and the edges derived from T1  , T2  , . . . , Tn  are determined. After executing Step 3, each tree contains all the edges in G. Algorithm NodeIST Input: n edge-independent spanning trees T1 ’, T2 ’, ..., Tn ’ rooted at u in n-edge-connected network G’, v is an arbitrary adjacent node of u in G’, where v > u; Output: n node-independent spanning trees T1 , T2 , ..., Tn rooted at node [u, v] in the line graph of G’, denoted as G; Begin Step 1: 1: V (Ti ) = V (G) and E(Ti ) = ∅ for i = 1 to n.

346

B. Cheng et al.

Step 2: 2: for i = 1 to n do in parallel 3: Suppose that there exists u(i) such that (u, u(i) ) ∈ E(Ti ’). 4: if (u(i)) = v) 5: E(Ti ) = E(Ti ) ∪ {([u, v], [u, u(i) ])}. 6: end if 7: if any edge (x, y) is adjacent to another edge (w, z) in V (Ti ’) 8: E(Ti ) = E(Ti ) ∪ {([x, y], [w, z])}. 9: end if Step 3: 10: for any edge (x, y) ∈ E(G’)\ (E(Ti ’) ∪ {(u, v)}) with x < y do 11: Suppose that there exists an edge (x, y (i) ) ∈ E(Ti ’). 12: E(Ti ) = E(Ti ) ∪ {([x, y], [x, y (i) ])}. 13: end for end

Example 1. Take the network G and its line graph G in Fig. 1 for example. The three trees in Fig. 2(a) are not edge-ISTs in G , because the 0, 7-path in the second tree and the 0, 7-path in the third tree have the common edge (2, 6). In Fig. 2(b), the three trees are edge-ISTs rooted at node 0 in G which are isomorphic to each other. Suppose that the three trees in Fig. 2(b) from left to right are T1  , T2  , and T3  . We let the three trees and node 1 as the input of Algorithm NodeIST. After the first step, we obtain trees T1 , T2 , and T3 shown in Fig. 2(c), the edge sets of which are empty and the node sets of which contain all the edge-nodes of G ; after the second step, the three trees are shown in Fig. 2(d); lastly, the constructed node-ISTs are demonstrated in Fig. 2(e). We notice that each node in Fig. 2(b) is denoted by one decimal value, while each node in Fig. 2(c), (d), and (e) are denoted by two decimal values. Now, T1 , T2 , and T3 are three node-ISTs rooted at [0, 1] in G.

3.2

Correctness of Node-Independent Spanning Trees Obtained by Algorithm NodeIST

By Algorithm NodeIST, every edge of G is contained in Ti for i = 1, 2, . . . , n. Thus, we have the following lemma. Lemma 1. Ti obtained by Algorithm NodeIST is a spanning tree in G for any integer i with 1 ≤ i ≤ n. Proof. By Algorithm NodeIST, Ti contains all the nodes in V (G) and it is easy to verify that Ti is a tree for any integer i with 1 ≤ i ≤ n. Thus, the proof is completed. 

ISTs in the Line Graphs of Interconnection Networks

347

Fig. 2. (a) Wrong edge-ISTs. (b) Correct edge-ISTs. (c) Trees obtained by Step 1 of Algorithm NodeIST. (d) Trees obtained by Step 2 of Algorithm NodeIST. (e) NodeISTs.

Suppose that T is a tree rooted at node [u, v] and [x, y] is an arbitrary node in the set V (G)\{[u, v]}. We use path([u, v], [x, y], T ) to denote the node set of the path started at [u, v] and ended at [x, y] in T . By the definition of independent spanning trees, we present the following lemma to redefine node-independent. Lemma 2. Let Ti and Tj be two different spanning trees rooted at node [u, v] in G where 1 ≤ i < j ≤ n. Ti and Tj are node-independent if and only if for every node [x, y] in G, [x, y] = [u, v], V (path([u, v], [x, y], Ti )) ∩ V (path([u, v],

348

B. Cheng et al.

[x, y], Tj )) = {[u, v], [x, y]} and V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], v[x, y], Tj )) ⊃ {[u, v], [x, y]}. Now we prove that the n trees obtained by Algorithm NodeIST are n nodeindependent spanning trees. Theorem 1. T1 , T2 , . . . , Tn obtained by Algorithm NodeIST are n nodeindependent spanning trees rooted at node [u, v] in G. Proof. By Lemma 1, Tl obtained by Algorithm NodeIST is a spanning tree in G for any integer l with 1 ≤ l ≤ n. Let [u, v] be the root node of each tree. We only need to prove that for any vertex [x, y] ∈ V (G)\{[u, v]} with x < y, V (path([u, v], [x, y], Ti )) ∩ V (path([u, v], [x, y], Tj )) = {[u, v], [x, y]} and V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], [x, y], Tj )) ⊃ {[u, v], [x, y]}. For any 1 ≤ i < j ≤ n and any edge (x, y) ∈ E(G )\{(u, v)}, we have the following cases: Case 1. (x, y) ∈ E(Ti  ) and (x, y) ∈ E(Tj  ). Then, the path(u, x, Ti  ) and path(u, x, Tj  ) are edge-disjoint (Similarly, path(u, y, Ti  ) and path(u, y, Tj  ) are edge-disjoint) by the hypothesis of Algorithm NodeIST. Thus, we can verify that E(path(u, x, Ti  )) ∩ E(path(u, x, Tj  )) = ∅. By Algorithm NodeIST, all the edges in path(u, x, Ti  ) and path(u, x, Tj  ) are transformed into nodes and the connected two edges are transformed into two adjacent nodes. Since E(path(u, x, Ti  )) ∩ E(path(u, x, Tj  )) = ∅, we have V (path([u, v], [x, y], Ti )) ∩ V (path([u, v], [x, y], Tj )) = {[u, v], [x, y]}. Let the node adjacent to node u in Ti  be w and the node adjacent to node u in Tj  be z. Since Ti  and Tj  are edge-independent, w = z. We have the following subcases. Case 1.1. w = v and z = v. It is clear that {w, z, x, y, u, v} ⊃ {x, y, u, v}. By Algorithm NodeIST, [u, z] ∈ V (path([u, v], [x, y], Tj )), which implies that V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], [x, y], Tj )) ⊃ {[u, v], [x, y]}. Case 1.2. w = v and z = v. The proof is similar to Case 1.1. Case 1.3. w = v and z = v. It is clear that {w, z, x, y, u, v} ⊃ {x, y, u, v}. By Algorithm NodeIST, [u, w] ∈ V (path([u, v], [x, y], Ti )) and [u, z] ∈ V (path([u, v], [x, y], Tj )), which implies that V (path([u, v], [x, y], Ti )) ∪ V (path([u, v], [x, y], Tj )) ⊃ {[u, v], [x, y]}. Case 2. (x, y) ∈ E(Ti  ) and (x, y) ∈ E(Tj  ). By Algorithm NodeIST, if the node adjacent to node u in Ti  is v, we can verify that V (path([u, v], [x, y], Ti )) equals to the set of edge-nodes transformed from edges in path(u, x, Ti  ) plus the set {[x, y]}. Otherwise, V (path([u, v], [x, y], Ti )) equals to the set of edgenodes transformed from edges in path(u, x, Ti  ) plus the set {[u, v], [x, y]}. The following proof is similar to Case 1. Case 3. (x, y) ∈ E(Ti  ) and (x, y) ∈ E(Tj  ). The proof is similar to Case 2. Case 4. (x, y) ∈ E(Ti  ) and (x, y) ∈ E(Tj  ). By Algorithm NodeIST, if the node adjacent to node u in Ti  is v, we can verify that V (path([u, v], [x, y], Ti )) equals to the set of edge-nodes transformed from edges in path(u, x, Ti  ) plus the set {[x, y]}. Otherwise, V (path([u, v], [x, y], Ti )) equals to the set of edge-nodes transformed from edges in path(u, x, Ti  ) plus the set {[u, v], [x, y]}.

ISTs in the Line Graphs of Interconnection Networks

349

If the node adjacent to node u in Tj  is v, we can verify that V (path([u, v], [x, y], Tj )) equals to the set of edge-nodes transformed from edges in path(u, x, Tj  ) plus the set {[x, y]}. Otherwise, V (path([u, v], [x, y], Tj )) equals to the set of edge-nodes transformed from edges in path(u, x, Tj  ) plus the set {[u, v], [x, y]}. Since Ti  and Tj  are edge-independent, the node adjacent to node u in Ti  and Tj  are different. The following proof is similar to Case 1. By Lemma 2, Ti and Tj are independent. As a result, the theorem holds.  Based on the n edge-independent spanning trees T1  , T2  , . . . , Tn  rooted at u in n-edge-connected network G , v is an arbitrary adjacent node of u in G , where v > u, the n node-independent spanning trees T1 , T2 , . . . , Tn rooted at node [u, v] in G are constructed in parallel, thus we have the following theorem. Theorem 2. The set of node-independent spanning trees T1 , T2 , . . . , Tn obtained by Algorithm NodeIST can be obtained in O(N ) time, where N is the number of nodes in G (or the number of edges in G ). Based on the above discussion, we further present the following observations. Observation 1. Algorithm NodeIST can be improved to obtain optimized node-ISTs. For example, in Fig. 2(e), if we let the node [1, 5] be adjacent to node [5, 7] in the third tree. Then, we can obtain another set of optimized nodeISTs with lower height. Observation 2. Given n node-independent spanning trees in an n-nodeconnected network G , we can also construct n node-independent spanning trees in the line graph of G based on Algorithm NodeIST. Observation 3. It is also interesting to study another similar algorithm with the reverse direction based on Algorithm NodeIST.

4

Simulation of Node-ISTs on the Line Graphs of Hypercubes

As well-known interconnection networks, hypercubes have received much attention from researchers. In this section, we mainly simulate the construction of node-ISTs on hypercubes based on Java and JUNG technology. The ndimensional hypercube Qn , is a graph consisting of 2n nodes and n2n−1 edges. Each node in Qn is represented by binary strings of length n, and any two nodes in Qn are adjacent whenever their corresponding strings differ in exactly one place. For example, Fig. 3 shows the four node-ISTs rooted at 0 in Q4 constructed by the algorithm in [30], the maximal height of which is 5. To simulate the duplicate nodes by JUNG technology, which do not admit the same node in one canvas, here, the prefixes A, B, C, D are only used to distinguish nodes, for example, A0, B0, C0, D0 are used to denote the same node 0. Since hypercube is node-symmetric, the line graph of hypercube is also nodesymmetric. If the nodes in 4-dimensional hypercube are 0, 1, . . . , 15, then there

350

B. Cheng et al.

Fig. 3. 4 edge-ISTs rooted at 0 on 4-dimensional hypercube.

are 32 edge-nodes in the line graph of 4-dimensional hypercube. For simplification, we use the numbers 1, 2, . . . , 32 to denote the edge-nodes, the corresponding relation is shown in Table 1, which will be used in the simulation program to show the node-ISTs. Similarly, the prefixes a, b, c, d are only used by the program to distinguish nodes, for example, a1, b1, c1, d1 are used to denote the same node 1. Table 1. Corresponding relations between numbers and edge-nodes. 1→ [0, 1]

2→ [0, 2]

3→ [0, 4]

4→ [0, 8]

5→ [2, 3]

6→ [1, 3]

7→ [1, 5]

8→ [3, 11]

9→ [1, 9]

10→ [2, 6]

11→ [3, 7]

12→ [6, 7]

13→ [4, 6]

14→ [6, 14]

15→ [7, 15]

16→ [5, 7]

17→ [4, 5]

18→ [2, 10]

19→ [4, 12]

20→ [5, 13]

21→ [8, 12]

22→ [12, 14] 23→ [12, 13] 24→ [14, 15]

25→ [13, 15] 26→ [10, 14] 27→ [11, 15] 28→ [9, 13] 31→ [9, 11]

32→ [8, 9]

29→ [8, 10]

30→ [10, 11]

ISTs in the Line Graphs of Interconnection Networks

351

The node-ISTs rooted at 1 (corresponding to the edge-node [0, 1] in the line graph of 4-dimensional hypercube) in the line graph of 4-dimensional hypercube based on Algorithm NodeIST are shown in Fig. 4, the height of the four trees are 4, 5, 5, 5, respectively. Take the internally node-disjoint paths between 1 and 25, the 4 paths are as follows:

Fig. 4. The node-ISTs on the line graph of 4-dimensional hypercube.

1→ 7→ 20→ 25 1→ 2→ 10→ 14→ 24→ 25 1→ 3→ 19→ 23→ 25 1→ 4→ 32→ 28→ 25 The paths denoted in edge-nodes are as follows: [0, 1]→ [1, 5]→ [5, 13]→ [13, 15] [0, 1]→ [0, 2]→ [2, 6]→ [6, 14]→ [14, 15]→ [13, 15] [0, 1]→ [0, 4]→ [4, 12]→ [12, 13]→ [13, 15] [0, 1]→ [0, 8]→ [8, 9]→ [9, 13]→ [13, 15] It is easy to verify that the paths between the edge-node [0, 1] and any other edge-node are also internally node-disjoint. The radial mode of the node-ISTs rooted at 1 are shown in Fig. 5. Here, the number of nodes deployed in the layers from the inside to the outside are 4, 9, 24, 39, 40, 12, respectively. Simulation results show that the maximal height of the node-ISTs rooted at any node in the line graph of n-dimensional hypercube is n+1. We have the following observation. Observation 4. The height of ISTs T1 and Ti in the line graph of n-dimensional hypercube are n and n + 1 for i = 2, 3, . . . , n, respectively, where n ≥ 3.

352

B. Cheng et al.

Fig. 5. The radial mode of node-ISTs.

Observing that all the height of the n optimal node-ISTs rooted at any node in n-dimensional hypercube Qn is n + 1 [30] and L(Qn ) contains more nodes than Qn for n ≥ 3, the set of node-ISTs rooted at any node in L(Qn ) have advantages in the height with respect to the number of nodes. If we abstract the interconnection network of severs in SWCube and BCDC, we obtain the line graph of generalized hypercube and crossed cube, respectively. Thus, we only need to construct independent spanning trees in the two networks. Let the input be the set of independent spanning trees from [26] and [31], we can use Algorithm NodeIST to construct independent spanning trees in the line graph of generalized hypercube and crossed cube, respectively.

5

Conclusions

In this paper, we have proved that if there are n edge-independent spanning trees rooted at an arbitrary node in the n-edge-connected network G , then there are n node-independent spanning trees rooted at an arbitrary node in the line graph of G . An algorithm to construct node-ISTs in G based on the node/edgeISTs in G is also presented. Some simulations of independent spanning trees on

ISTs in the Line Graphs of Interconnection Networks

353

the line graphs of hypercubes were presented and we also pointed out that the algorithm proposed in this paper can be used to construct independent spanning trees on SWCube and BCDC data center networks. It is still interesting to prove that either the node conjecture implies the edge conjecture, or vice versa. Acknowledgment. This work is supported by National Natural Science Foundation of China (No. 61572337, No. 61502328, and No. 61602333), China Postdoctoral Science Foundation Funded Project (No. 2015M581858), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 18KJA520009), the Jiangsu Planned Projects for Postdoctoral Research Funds (No. 1501089B and No. 1701173B), Opening Foundation of Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks (No. WSNLBKF201701), and Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX17 2005 and No. KYCX18 2510).

References 1. Bonomom, F., Dur´ an, G., Safe, M.D., Wagler, A.K.: Clique-perfectness of complements of line graphs. Discret. Appl. Math. 186(1), 19–44 (2015) 2. Bao, F., Funyu, Y., Hamada, Y., Igarashi, Y.: Reliable broadcasting and secure distributing in channel networks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E81–A, 796–806 (1998) ¨ 3. Bao, F., Igarashi, Y., Ohring, S.R.: Reliable broadcasting in product networks. Discret. Appl. Math. 83(1–3), 3–20 (1998) 4. Chen, Y.-S., Chiang, C.-Y., Chen, C.-Y.: Multi-node broadcasting in all-ported 3-D wormhole-routed torus using an aggregation-then-distribution strategy. J. Syst. Arch. 50(9), 575–589 (2004) 5. Cheng, B., Fan, J., Jia, X., Zhang, S.: Independent spanning trees in crossed cubes. Inf. Sci. 233(1), 276–289 (2013) 6. Cheng, B., Fan, J., Jia, X., Wang, J.: Dimension-adjacent trees and parallel construction of independent spanning trees on crossed cubes. J. Parallel Distrib. Comput. 73, 641–652 (2013) 7. Cheng, B., Fan, J., Lyu, Q., Zhou, J., Liu, Z.: Constructing independent spanning trees with height n on the n-dimensional crossed cube. Futur. Gener. Comput. Syst. 87, 404–415 (2018) 8. Cheng, B., Fan, J., Jia, X., Jia, J.: Parallel construction of independent spanning trees and an application in diagnosis on M¨ obius cubes. J. Supercomput. 65(3), 1279–1301 (2013) 9. Cheriyan, J., Maheshwari, S.N.: Finding nonseparating induced cycles and independent spanning trees in 3-connected graphs. J. Algorithms 9(4), 507–537 (1988) 10. Curran, S., Lee, O., Yu, X.: Finding four independent trees. SIAM J. Comput. 35(5), 1023–1058 (2006) 11. Dong, F., Yan, W.: Expression for the number of spanning trees of line graphs of arbitrary connected graphs. J. Graph Theory 85(1), 74–93 (2017) 12. Gopalan, A., Ramasubramanian, S.: A counterexample for the proof of implication conjecture on independent spanning trees. Inf. Process. Lett. 113(14–16), 522–526 (2013) 13. Gopalan, A., Ramasubramanian, S.: On constructing three edge independent spanning trees. SIAM J. Comput. (2011, submitted)

354

B. Cheng et al.

14. Hasunuma, T.: Structural properties of subdivided-line graphs. J. Discret. Algorithms 31, 69–86 (2015) 15. Harvey, D.J., Wood, D.R.: Treewidth of the line graph of a complete graph. J. Graph Theory 79(1), 48–54 (2015) 16. Hoyer, A., Thomas, R.: Four edge-independent spanning tree. SIAM J. Discret. Math. 32(1), 233–248 (2018) 17. Huck, A.: Independent trees in planar graphs. Graphs Comb. 15(1), 29–77 (1999) 18. Hussain, Z., AlBdaiwi, B., Cerny, A.: Node-independent spanning trees in Gaussian networks. J. Parallel Distrib. Comput. 109, 324–332 (2017) 19. Li, D., Wu, J.: On data center network architectures for interconnecting dual-port servers. IEEE Trans. Comput. 64(11), 3210–3222 (2015) 20. Itai, A., Rodeh, M.: The multi-tree approach to reliability in distributed networks. Inf. Comput. 79(1), 43–59 (1988) 21. Khuller, S., Schieber, B.: On independent spanning trees. Inf. Process. Lett. 42(6), 321–323 (1992) 22. Kim, J.-S., Lee, H.-O., Cheng, E., Lipt´ ak, L.: Independent spanning trees on even networks. Inf. Sci. 181(13), 2892–2905 (2011) 23. Kim, J.-S., Lee, H.-O., Cheng, E., Lipt´ ak, L.: Optimal independent spanning trees on odd graphs. J. Supercomput. 56(2), 212–225 (2011) 24. Li, H., He, W., Yang, W., Bai, Y.: A note on edge-disjoint Hamilton cycles in line graphs. Graphs Comb. 32, 741–744 (2016) 25. Liu, Y.-J., Chou, W.Y., Lan, J.K., Chen, C.: Constructing independent spanning trees for locally twisted cubes. Theor. Comput. Sci. 412(22), 2237–2252 (2011) 26. Obokata, K., Iwasaki, Y., Bao, F., Igarashi, Y.: Independent spanning trees of product graphs and their construction. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E79–A(11), 1894–1903 (1996) 27. Su, G., Xu, L.: Topological indices of the line graph of subdivision graphs and their Schur-bounds. Appl. Math. Comput. 253, 395–401 (2015) 28. Tian, T., Xiong, L.: Traceability on 2-connected line graphs. Appl. Math. Comput. 321, 1339–1351 (2018) 29. Tseng, Y.-C., Wang, S.-Y., Ho, C.-W.: Efficient broadcasting in wormhole-routed multicomputers: a network-partitioning approach. IEEE Trans. Parallel Distrib. Syst. 10(1), 44–61 (1999) 30. Tang, S.-M., Wang, Y.-L., Leu, Y.-H.: Optimal independent spanning trees on hypercubes. J. Inf. Sci. Eng. 20(1), 143–155 (2004) 31. Wang, X., Fan, J., Lin, C.-K., Zhou, J., Liu, Z.: BCDC: a high-performance, servercentric data center network. J. Comput. Sci. Technol. 33(2), 400–416 (2018) 32. Yang, J.-S., Tang, S.-M., Chang, J.-M., Wang, Y.-L.: Parallel construction of optimal independent spanning trees on hypercubes. Parallel Comput. 33(1), 73–79 (2007) 33. Zehavi, A., Itai, A.: Three tree-paths. J. Graph Theory 13(2), 175–188 (1989)

POEM: Pricing Longer for Edge Computing in the Device Cloud Qiankun Yu, Jigang Wu(B) , and Long Chen Guangdong University of technology, Guangzhou 510006, China [email protected], [email protected], [email protected]

Abstract. Multiple access mobile edge computing has been proposed as a promising technology to bring computation services close to end users, by making good use of edge cloud servers. In mobile device clouds (MDC), idle end devices may act as edge servers to offer computation services for busy end devices. Most existing auction based incentive mechanisms in MDC focus on only one round auction without considering the time correlation. Moreover, although existing single round auctions can also be used for multiple times, users should trade with higher bids to get more resources in the cascading rounds of auctions, then their budgets will run out too early to participate in the next auction, leading to auction failures and the whole benefit may suffer. In this paper, we formulate the computation offloading problem as a social welfare optimization problem with given budgets of mobile devices, and consider pricing longer of mobile devices. This problem is a multiple-choice multi-dimensional 0-1 knapsack problem, which is a NP-hard problem. We propose an auction framework named MAFL for long-term benefits that runs a single round resource auction in each round. Extensive simulation results show that the proposed auction mechanism outperforms the single round by about 55.6% on the revenue on average. Keywords: Edge computing · Computation offloading Multiple rounds · Mobile device cloud · Long-term · Auction

1

Introduction

In the past few years, despite the increasing capabilities of mobile devices including smart phones, Internet of Things (IoT) devices, and wearable devices, resource requirements for mobile applications can often transcend the computation of a single device [1–4]. Therefore, mobile cloud computing is proposed to offload tasks to remote cloud for execution [5–9], though it may introduce longer delay and user experience may suffer. Moreover, long distance telecommunication will consume more energy. In recent work, multiple access mobile edge computing has been proposed as a promising technology to bring computation services close to end users, by making good use of edge cloud servers. There are three types of architecture used in edge computing [10]: edge server, c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 355–369, 2018. https://doi.org/10.1007/978-3-030-05057-3_28

356

Q. Yu et al.

coordinator device, and device cloud. This paper uses the third architecture. The computation offloading [11,12] can be performed in Mobile Device Clouds (MDC) [13–16], which use idle resources of nearby mobile devices to execute tasks. However, mobile devices that provide idle resources may also incur extra cost to themselves, which should be monetary compensated. To encourage more devices sharing their idle resources, several prior works have been done in MDC. Miluzzo et al. [17] proposed an incentive scheme in MDC. However, this scheme ignored the resource requirements of tasks. Song et al. [18] designed a non-competitive pricing mechanism, with a bill backlog threshold. If a device exceeds the threshold, it can reduce its bill backlog by providing services for others. Otherwise it will not be able to get the service. However, they do not consider whether the device has sufficient resources to provide services for others or not. Wang et al. [19] proposed a Stackelberg Game approach for cooperative application execution in mobile cloud computing. However, they do not consider that the mobile device is heterogeneous, different mobile devices may have different processing power levels and energy consumption levels. Therefore, the payment of the tasks should be different. In recent studies, auction has been widely used as one of the most popular incentive schemes in many areas, such as virtual machine allocation [20,21] and wireless spectrum allocation [22,23]. The celebrated VCG mechanism [24] is a well known type of auction. It is essentially the only type of auction that simultaneously guarantees both truthfulness and absolute economic efficiency. Li et al. [25] proposed an online spectrum auction framework. This mechanism can also be used in MDC’s resource allocation, but buyer’s budget constraints are not considered. Jin et al. [26] designed an incentive compatible auction mechanism for cloudlet resource sharing in mobile cloud computing. However, this mechanism uses a one-to-one match and assumes that the resource requirements are homogeneous. In this work, we consider a seller can serve multiple buyers and the resource requirements of buyers are heterogeneous. Wang et al. [27] designed an efficient auction mechanism to solve the task assignment problem in MDC. However, this auction mechanism assumes that every buyer must be allocated to resources. In this work, we consider that resources are limited and can not ensure that every buyer can be allocated resources. In MDC, existing auction mechanisms only focus on a single-round auction [26,27]. In many cases, we need multiple rounds of auctions. Although existing single round auctions can also be used for multiple times, the user’s budget constraints should be considered. The budget is the total amount of money a buyer could pay, it plays a key role in designing periodical auctions. Despite long-term and budget constraints are considered in crowdsourcing [28], tasks are homogeneous and a task can be allocated to multiple workers. However, authors in [28] as assumed unlimited resources at workers. In a resource limited MDC, idle mobile devices are unlikely to meet the needs of all users at the same time. So the scheme can’t be used directly in MDC, which motivates us to design a long-term auction of multiple rounds with budget constraint. To design effective schemes, the following challenges should be property handled: (1) How to prevent the user’s budget from running out prematurely by multi-round

POEM: Pricing Longer for Edge Computing in the Device Cloud

357

auction? (2) How to efficiently allocate resources for different bids of different devices? (3) How to attract more sellers to participate in MDC? To solve the above challenges, in this paper, we consider Pricing lOnger for Edge coMputing in the device cloud (POEM). We aim to design a long-term auction of multiple rounds. The main features are as follows: (1) the mobile tasks are indivisible and the number of resources (CPU, memory, battery etc.) requirements for a task are different. (2) The number of resources requested by each user is not a fixed value in each round, and the amount of resources provided by the nearby mobile device is also not a fixed value in each round. (3) We punish the winning user to reduce the bid according to its remaining budget in the next round of the auction. The main contributions of this paper are as follows: • Considering the time correlation of resource allocation, we formulate the task offloading problem as an integer linear programming. And we design an MDC Auction Framework for Long-term (MAFL). The next round of genuine bids will be adjusted according to the results of the previous round. • We design a Single Round Mobile Resources Auction (SRMRA) algorithm for comparing purposes with the MAFL. And we demonstrate the performance of the algorithm by proofs and extensive experiments. • We conduct extensive simulation experiments to demonstrate the performance of our mechanism. MAFL is better than the single round auction SRMRA. MAFL outperforms SRMRA by about 12.2% on revenue when the number of users is 40, the 80 round auction is performed. MAFL outperforms SRMRA by about 55.6% on revenue on average when the number of users changed from 10 to 80, the 80 round auction is performed. The rest of the paper is organized as follows. Section 2 describes the system model and problem formulation. The auction mechanism for single round MDC’s resources allocation is designed in Sect. 3. Section 4 proposes an auction framework in the MDC for long-term optimisation. Section 5 presents the simulation results. Finally, we conclude the paper in Sect. 6.

2 2.1

Problem Definition System Model

We assume that the total time of the whole auction period is T (T is a long time) [21], and divide T into multiple time slots. Perform one round auction at each time slot l ∈ L, where L = {1, 2, 3, · · · , L} and L is the total number of auction rounds. There are U users in the MDC, each user u ∈ U needs some resources (CPU, memory, battery etc.) to perform its indivisible tasks, where U = {1, 2, 3, · · · , U }. There are M sellers in MDC, each seller m ∈ M can share (l) their resources with others, where M = {1, 2, 3, · · · , M }. Let ru be the amount (l) of resources requested by user u and Rm be the amount of resources provided by the seller in the l-th round. The U users are bidders in the auctions, each user

358

Q. Yu et al. (l)

(l)

(l)

(l)

(l)

(l)

submits its valuation Vu = {vu,1 , vu,2 , · · · , Vu,M } in round l, where vu,m ∈ Vu denotes the valuation of buyer for seller m. Moreover, as each user is also budget constrained, we use Bu to denote the user u’s total budget in all rounds. Of course, sellers can not provide unlimited resources. So we use Wm to represent the total number of resources provided by the seller m in all rounds. Specifically, (l) (l) (l) (l) the resource allocation is determined by Yu = {yu,1 , yu,2 , · · · , yu,M }, where (l)

yu,m ∈ {0, 1} is a binary indicator whose value is 1 if user u’s tasks is performed on seller m in round l and 0 otherwise. We list basic notations used in this paper in Table 1. Table 1. Basic notations Notation

Descriptions

T

The total time of whole auction period

L

The total number of auction rounds

U, M

The total number of users and sellers

L

The set {1, 2, 3, · · · , L}

U, M

The set of users and sellers

(l)

ru

The amount of resources requested by user u in the l-th round

(l)

The amount of resources provided by the seller m in the l-th round

Vu

(l)

The set of u’s valuation in the l-th round

(l) Yu (l) vu,m (l) yu,m

The set of u’s indicator in the l-th round

Bu

The user u’s total budget

Wm

The total number of resources provided by seller m

Rm

2.2

The user u’s valuation for seller m in the l-th round The user u wins the resources provided by the seller m in the l-th round or not

Problem Formulation

The objective of the MDC (mobile device clouds) resource allocation problem is to maximize the user’s bids. In the whole auction period, the higher the total price of the user’s bids, the more compensation of the device that provided, so the more people will be attracted to share the idle resources in his device. We formalize our objective as follows:   (l) (l) OPT-1 obj : max vu,m yu,m (1) l∈L u∈U m∈M

subject to:  m∈M

(l) yu,m ≤ 1 ∀u ∈ U ∀l ∈ L

(1-1)

POEM: Pricing Longer for Edge Computing in the Device Cloud



(l) (l) ru(l) yu,m ≤ Rm

∀m ∈ M ∀l ∈ L

359

(1-2)

u∈U

 

(l) (l) vu,m yu,m ≤ Bu

∀u ∈ U

(1-3)

∀m ∈ M

(1-4)

m∈M l∈L



(l) ru(l) yu,m ≤ Wm

l∈L u∈U

(l) yu,m ∈ {0, 1}

∀u ∈ U ∀m ∈ M ∀l ∈ L

(1-5)

The constraint (1-1) means that a user’s task can only be performed on one device. Constraint (1-2) ensures that the resources of the devices that can be provided is limited in each round, so it is forbidden to exceed the number of resources the device offered. The constraint (1-3) is to make sure that the user’s bid can’t exceed its budget in the whole auction period of T . The constraint (1-4) indicates that the amount of resources provided by sellers is limited in the whole auction period of T . Theorem 1. Social welfare optimization problem (OPT-1) is NP-hard. Proof. The multiple-choice multi-dimensional knapsack problem is a NP-hard problem [29]. In OPT-1, the amount of resources that each seller can provide is equivalent to the capacity of the backpack in each round of the auction. The resource requirement of each user is equivalent to the weight of the object. Each user can be allocated to resources or not. So OPT-1 is a special case of the multiple-choice multi-dimensional 0-1 knapsack problem, which is NP-hard. We ignore the indicator variable constraint (1-5) temporarily, and introduce dual variable vectors α, β, η and χ. We then obtain the dual problem of OPT-1:      (l) (l) (l) OPT-2 obj : min Bu αu(l) + βu(l) + Rm ηm + Wm χm u∈U

u∈U l∈L

m∈M l∈L

m∈M

(2) subject to: (l) αu + βu(l) + vu,m



(l) (l) (l) ru(l) ηm + ru(l) χm ≥ vu,m ∀u ∈ U ∀m ∈ M ∀l ∈ L

(2-1)

m∈M

(l) (l) , χm ∈ [0, 1] αu(l) , βu(l) , ηm

∀u ∈ U ∀m ∈ M ∀l ∈ L

(2-2)

360

Q. Yu et al.

Since we do not know all the information in each round auction during the whole auction period of T , i.e. the demand for user resources and the corresponding bids, as well as the amount of resources provided by sellers. The auction mechanism is carried out round after round with time. So we just consider the current bids and resources in each round auction. To prevent users from running out of budget too early. We adjust the user’s bid according to its remaining budget in each round. So we introducean auxiliary  (l) (l) (l) (l) (l −1) variable αu for each user u ∈ U, where αu ∈ [0, 1]. Let vu,m =vu,m 1 −αu denote the real valuation. Now, we give the following formulation.   (l) (l) OPT-3 obj : max vu,m yu,m (3) u∈U m∈M

subject to: 

(l) yu,m ≤ 1 ∀u ∈ U

(3-1)

m∈M



(l) (l) ru(l) yu,m ≤ Rm

∀m ∈ M

(3-2)

u∈U

(l) ∈ {0, 1} yu,m

∀u ∈ U ∀m ∈ M

(3-3)

We ignore the indicator variable constraint (3-3) temporarily, and adopt the same dual variables as in the dual of (1). We then obtain the dual problem of OPT-3:   (l) (l) βu(l) + Rm ηm (4) OPT-4 obj : min u∈U

m∈M

subject to: βu(l) +



(l) (l) ru(l) ηm ≥ wu,m

∀u ∈ U ∀m ∈ M

(4-1)

m∈M

(l) ∈ [0, 1] βu(l) , ηm

3

∀u ∈ U ∀m ∈ M

(4-2)

Single Round Resources Auction Design in MDC

In this section, we focus on the design of Single Round Mobile Resources Auction (SRMRA), and we prove that SRMRA is truthful, individual rationality. The detailed description of SRMRA in round l is showed in Algorithm 1. The resources demand information of users and the resources information shared by sellers is collected by the auctioneer. The U users are bidders in auctions, each

POEM: Pricing Longer for Edge Computing in the Device Cloud

361

submits a bid containing M valuations of sellers in round l. We consider that users or sellers may join and leave during the auction period. In this case, the default is 0 in bids. We use Q to denote the set of winners. We choose the user (l) (l) u who has the largest bid density vu,m /ru , i.e. the algorithm chooses each user according to the bid and the amount of requested resources, and always chooses the user with a highest bid on few resources as the winner. However, the resources provided by the seller are limited. Resource allocation cannot exceed the amount of resources shared by the seller in round l (line 13). And the amount of resources provided by sellers in the whole auction period is limited (line 3 and 14). Then, (l) We use the VCG price mechanism. Let pu denote the price of the user’s final (l) (l) payment in round l. Let S−u and Su denote the social welfare achieved when winner u is excluded and the social welfare achieved when u is not involved in (l) (l) (l) bidding in round l, respectively. The payment of the winner u, pu = S−u − Su . Algorithm 1. (SRMRA): Single Round Mobile Resources Auction 1: for m = 1, 2, 3, · · · , M do 2: The amount of shared resources collected from seller m. (l) 3: if Wm < Rm then (l) 4: Rm = Wm ; 5: end if 6: end for 7: for u = 1, 2, 3, · · · , U do (l) (l) (l) (l) (l) 8: Collect bid Vu = {vu,1 , vu,2 , · · · , vu,M } and resource requirements quantity ru from user u. 9: end for 10: Q = ∅ ; 11: for all u ∈ / Q do   (l) (l) u∈U m∈M; 12: {u, m} = arg max vu,m /ru (l)

(l)

13: if ru ≤  Rm then (l) (l) (l) (l) 14: Q =Q u; Rm =Rm −ru ; Wm =Wm −ru ; (update Wm ) 15: end if 16: end for 17: for all u ∈ Q do 18: Execution of the 10 to 16 line of the algorithm again with user u excluded. (l) (l) (l) 19: pu = S−u − Su ; 20: end for

Theorem 2. SRMRA is a truthful auction mechanism. Proof. If the allocation algorithm is monotone and exact and the payment scheme calculates critical value for each winner, then the mechanism is truthful [30]. From line 12 of the SRMRA, it is clear that a user can increase its chance of winning by increasing its bid. Therefore, the winner determination algorithm

362

Q. Yu et al.

of SRMRA is monotone. Then, a winning bidder u pays the minimum amount it has to bid to get resources, i.e., its critical value. This is done by finding the losing bidder who would win if u would not participate in the auction. User u’s minimum bid density has to be at least equal to the bid density of user the losing bidder for winning its resources. Therefore, user u’s critical valuation is (l) (l) vu,m /ru , which is the payment calculated by SRMRA. Thus, we conclude that SRMRA is a truthful mechanism. Theorem 3. SRMRA is individual rationality. (l)

Proof. pu is the critical value for winner u by the analysis in Theorem 2. Thus (l) (l) (l) (l) (l) (l) pu ≤ vu,m . Due to vu,m ≥ vu,m , we can conclude that pu ≤ vu,m . So, SRMRA is individual rationality.

4

MDC Auction Framework Design for Long-Term

In this section, we propose a MDC Auction Framework for Long-term (MAFL) that runs a Single Round Mobile Resources Auction (SRMRA) in each round. Then, we give the theoretical analysis of approximate ratio of MAFL. Although existing single round auctions can also be used for multiple times, users should trade with higher bids to get more resources in the cascading rounds of auctions. On the one hand, some users continue to bid a high price, causing other users can not get resources. On the other hand, if a large portion of buyers run out their budget rapidly in short-terms, the users who participate in the competition may be reduced. Therefore, the total revenue may reduce significantly. Our main idea is to design an appropriate long-term auction framework with budget constraint handled elaborately. In MAFL (Algorithm 2), we intro(l) duce an auxiliary variable αu ∈ [0, 1] for each user u ∈ U. Its initial value is 0, which increases with the decrease of the remaining budget of the user. Then, in  (l) (l) (l −1) as the virtual valuation for user each round l, we use vu,m =vu,m 1 −αu (l)

u. After executing SRMRA, Let Q be set of winning users and adjust αu for each user u ∈ Q. The detailed process is displayed in Algorithm 2. Theorem 4. When the algorithm MAFL terminates, the constraint condition (1-1) are satisfied in the formulation (1). And each user will not be over bud  (l) (l) get with a factor of 1 +ϕ, i.e. vu,m yu,m ≤ Bu (1 + ϕ), ∀u ∈ U, where m∈M l∈L   (l) ϕ = maxu∈U ,m∈M,l∈L vu,m /Bu . (l)

Proof. αu is the auxiliary variable we have introduced. When the user u gets (l) (l) the requested resources, the αu will be increase, where αu is the budget that has been used at the end of round l in MAFL. Therefore, when the user u runs

POEM: Pricing Longer for Edge Computing in the Device Cloud

363

Algorithm 2. (MAFL): MDC Auction Framework for Long-term (l)

1: αu =0 ∀u ∈ U ; 2: for l = 1, 2, 3, ·· · , L do  (l) (l) (l −1) ; 3: vu,m =vu,m 1 −αu 4: Execute SRMRA, Let Q be set of winning users. 5: for all u ∈ Q do  (l −1) (l) 6: if αu + vu,m Bu < 1 then  (l) (l −1) (l) 7: αu = αu + vu,m Bu ; 8: else (l) 9: αu = 1; 10: end if 11: end for 12: for all u ∈ / Q do (l) (l −1) ; 13: αu =αu 14: end for 15: end for (l) (L) ∀u ∈ U; 16: αu =αu

(l)

out its budget (αu = 1), it won’t get any more resources in the next rounds. (l) We assume αu = 1, when l = l∗ , then we get:     (l) (l) (l) (l) vu,m yu,m ≤ vu,m yu,m m∈M 1≤l≤l∗

m∈M l∈L

=





(l) (l) vu,m yu,m

m∈M 1≤l0  B, if t = 1 ykt = yk(t−1) − fk(t−1)j , else 2 ≤ t ≤ 5

(8)

where K represents the total number of MCVs, B and M are the battery capacity and the number of charging guns of each MCV, respectively. We assume that one MCV can meet the charging demand of 10 EVs and it has 3 charging guns. We further assume that fktj is the amount of demand can be satisfied this time for each MCV dispatched from the current position i to the target parking lot j and if fktj = 0 the MCV will not be scheduled; ykt is the left power of MCV k at the start of time period t, by default, one MCV can charge no more than  represents the current demand at the grid point 3 EVs at the same time; Etj

Towards an Efficient and Real-time Scheduling Platform for MCVs

411

j in the time period t and the initial demand at j is Etj . As shown in Eq. (9), after a MCV k is scheduled to this point during this time period, the demand for  , and then the distribution of UD in the current time period satisfaction is fktj will be reset.   = Etj − fktj (9) Etj GSD and Scheduling Distance (GSDD): This strategy considers both the UD in each parking lot and the scheduling distance of this schedule, because the movement of MCVs will also consume part of the power. For this strategy, the schedule result is the same as that of GSD for t = 1, which satisfies the Eq. (8). Starting from the second time period, the objective function is as follows: max :

K  n 

fktj − μkt × rate

k=1 j=1  s.t.: fktj = min[M, ykt , Etj ]>0

2≤t≤5 0 < rate < 1

(10)

μkt ≥ 0 where μkt represents the distance traveled by the MCV k in the time period t, and rate is the influence of the distance on each dispatch. When rate = 0, this strategy is equivalent to the previous scheduling strategy. Global Optimization Strategy Based on Demand and Scheduling Distance (GOSDD): Unlike the second scheduling strategy, this strategy considers the possible scheduling situations for all time periods, and then find the optimal solution. We define the penalty cost to represent the sum of the weighted dispatch distance and the UD of all the dispatches in each time period. The specific objective function is as follows: min :

n  j=1

s.t.:

K 

δtj +

K 

μkt × rate

k=1

fktj + δtj = Etj

k=1

(11)

δtj ≥ 0, 1 ≤ t ≤ 5 0 ≤ fktj ≤ ykt and fktj ≤ M × xktj n  xktj = {0, 1} and xktj = 1 j=1

K where the k=1 fktj and δtj represent the total amount of demand that can and cannot be satisfied by all the MCVs which are scheduled to the grid point j during the time period t respectively. When xktj = 1, MCV k is scheduled to the parking lot j during time period t.

412

5

Q. Liu et al.

Evaluation

Figure 11 shows the impact of different distance weights on the total number of services, where the ordinate is the total number of services in a day. We can see that the curve of rate = 0.05 is similar to the one of rate = 0, due to the reason that the radius of Sixth Ring Road is about 25 km. In the case of rate = 0.1, the total number of services in a day reaches the upper limit when K = 150. Figure 12 shows the service times distribution of different numbers of MCVs in each time period when rate = 0.1. It can be seen that affected by demand, for all possible K, the service times is the lowest when t = 1, and reaches the maximum at t = 3. For K = 50, we can see that it has the least amount of service at t = 5, which is due to the fact that most MCVs have reached the maximum number of services before the 5th time period. In addition, we can also see in this figure that the distribution of K = 150 is similar to that of K = 200, which also shows that for rate = 0.1, 150 is already the upper limit of K, beyond which the extra MCVs will not bring additional improvements to the service times. Figure 13 shows the utilization of different numbers of MCVs and the number of EVs they served in one day when rate = 0.1. In this figure, the blue and red curves represent the utilization rate of MCVs and the number of service times, respectively. From the blue curve, we can see that when the number of MCVs is less than 75, the utilization of MCVs is more than 95%. It is because the amount of UD is far greater than the services that MCVs can provide, so before the end of the day, most MCVs have already run out of electricity. The number of unserved

Fig. 11. Scheduling distance weight

Fig. 12. MCVs number analysis

Fig. 13. Usage rate and service times

Fig. 14. Number of unmet EVs

Towards an Efficient and Real-time Scheduling Platform for MCVs

413

Fig. 15. Analysis of scheduling results (rate = 0.1, K = 25)

Fig. 16. Analysis of scheduling results (rate = 0.1, K = 50)

Fig. 17. Analysis of scheduling results (rate = 0.1, K = 75)

EVs in each time period before and after deploying MCVs when rate = 0.1 is shown in Fig. 14. The histograms of blue and orange indicate the number of unserved EVs in each time period when K = 0 and K = 150, respectively. We compared the scheduling results of these three different scheduling strategies in Figs. 15, 16 and 17. Figure 15a shows the comparison of the number of services in each time period. It can be seen that before the start of the 5th time period, the demand satisfied by GSD is slightly higher than those of the other two strategies. In the case of t = 5, the number of services under each of the three strategies drops sharply, because most of MCVs run out of power before the start of the 5th time period. Comparison between Figs. 16a and 17a shows that as K increases, the number of services in each time period also increases, and the change is the most significant when t = 5.

414

Q. Liu et al.

Figure 15b shows the scheduling distance of MCVs from the previous time period to the next time period. It can be seen that the scheduling distance of GSD is significantly greater than that of the other two scheduling results. The total scheduling distance from t = 4 to t = 5 is about 0, which also shows that most MCVs have run out of their service power at the end of t = 4. Similarly, comparing Figs. 16b and 17b, we can also conclude that as K increases, the dispatch distance in each time period will also increase. However, in comparison, GSD has the largest increase, and the changes in GSDD and GOSDD are small. Figure 15c shows the comparison between the total penalty costs of different scheduling strategies in each time period, that is, the weighted sum of the unmet demand and the scheduling distance. Comparing with Figs. 16c and 17c, we can see that with the increase of K, the difference between GSD and the other two strategies also increases. In addition, it can be seen that for the penalty cost, no matter how big the K is, the total amount of penalty in each time period is about the same, because when K is relatively small, the dispatching distance is small, and consequently the corresponding demand that can be satisfied is also limited. If K is larger, the total dispatching distance will increase, but the number of services in each time period also increases. Of course, as K increases, the average penalty for each MCV will become smaller and smaller. Through the comparison of Figs. 15, 16 and 17, we can find that the effect of GSD is obviously worse than the other two scheduling strategies when the size of K is moderate. The overall scheduling effect of GSDD and GOSDD is similar. But the computational complexity of GOSDD is significantly larger because it needs to find the optimal solution from all possible schedules. With the increase of K, the time GOSDD takes will increase exponentially, which is not suitable for real-time dispatch. While for GSDD, the complexity is much lower, which is more suitable for the real-time application.

6

Conclusion

MCVs have good mobility and scalability. They can be used not only to alleviate the pressure of charging stations and reduce the waiting time for users, but also to provide guidance for the construction of charging stations in the future. For example, the place where MCVs are often dispatched may be more suitable for building a new charging station. The size of the charging station can also be roughly estimated according to the service times of the MCVs. From the comparison of the three different scheduling strategies, the scheduling result of GSDD is similar to that of GOSDD. If we consider the computation complexity of the algorithms, GSDD is more applicable to actual scheduling. In the future work, we will consider unexpected events, such as the break down of EVs caused by the depletion of electricity, in order to achieve multi-functional scheduling. One the other hand, we will also consider the impact of different driving paths (especially with traffic jam) on the scheduling of MCVs. Acknowledgment. This work was partially funded by NSFC-61472384. And we are particularly grateful for the cooperation and support from echarge.

Towards an Efficient and Real-time Scheduling Platform for MCVs

415

References 1. China has methodically built the world’s largest market for electric vehicles. http:// cn.wsj.com/gb/20171006/biz102359.asp 2. The new parking space in Beijing must be equipped with charging piles. http:// www.evehicle.cn/?p=3579 3. Charging pile, why became a “stumbling block” of electric cars? http://society. people.com.cn/n1/2016/0603/c1008-28410174.html 4. OpenStreetMap contributors. https://www.openstreetmap.org 5. Zhang, Q.H., Xiu, N.N., Cheng, G.Q., Wang, Z.: Research on gas station selection support system with given refueling volume. In: Advanced Materials Research, pp. 756–760. Trans Tech Publ (2012) 6. Wang, J., Li, J., Pang, T., Sun, X., Liu, Q., Liu, H.: Towards a holistic and optimized framework for smart grid regulation. In: The 36th IEEE International Performance Computing and Communications Conference (IPCCC). IEEE (2017) 7. Sun, X., Li, J., Zheng, W., Liu, H.: Towards a sustainable incentive mechanism for participatory sensing. In: 2016 IEEE First International Conference on Internetof-Things Design and Implementation (IoTDI), p. 4960. IEEE (2016) 8. Xu, P., Li, J., Sun, X., Zheng, W., Liu, H.: Dynamic pricing at electric vehicle charging stations for Queueing delay reduction. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 2565–2566. IEEE (2017) 9. Malandrino, F., Casetti, C., Chiasserini, C.-F.: The role of its in charging opportunities for EVs. In: 2013 16th International IEEE Conference on Intelligent Transportation Systems-(ITSC), pp. 1953–1958 (2013) 10. Li, J., Sun, X., Liu, Q., Zheng, W., Liu, H., Stankovic, J.: Planning electric vehicle charging stations based on user charging behavior. In: The 3rd ACM/IEEE International Conference on Internet-of-Things Design and Implementation (2018) 11. Ge, S., Feng, L., Liu, H.: The planning of electric vehicle charging station based on grid partition method. In: 2011 International Conference on Electrical and Control Engineering (ICECE), pp. 2726–2730 (2011) 12. Li, Y., Luo, J., Chow, C.-Y., Chan, K.-L., Ding, Y., Zhang, F.: Growing the charging station network for electric vehicles with trajectory data analytics. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1376–1387. IEEE (2015) 13. Liu, Z., Wen, F., Ledwich, G.: Optimal planning of electric-vehicle charging stations in distribution systems. IEEE Trans. Power Deliv. 28, 102–110 (2013) 14. Liu, Z.-F., Zhang, W., Ji, X., Li, K.: Optimal planning of charging station for electric vehicle based on particle swarm optimization. In: Innovative Smart Grid Technologies-Asia (ISGT Asia), pp. 1–5. IEEE (2012) 15. Timpner, J., Wolf, L.: Design and evaluation of charging station scheduling strategies for electric vehicles. IEEE Trans. Intell. Transp. Syst. 15, 579–588 (2014) 16. Chen, T.D., Kockelman, K.M., Khan, M., et al.: The electric vehicle charging station location problem: a parking-based assignment method for seattle. In: Transportation Research Board 92nd Annual Meeting, pp. 13–1254 (2013) 17. McPherson, C., Richardson, J., McLennan, O., Zippel, G.: Planning an electric vehicle battery-switch network for Australia. In: Australasian Transport Research Forum 2011 Proceedings (2011) 18. Zheng, Y., Dong, Z.Y., Xu, Y., Meng, K., Zhao, J.H., Qiu, J.: Electric vehicle battery charging/swap stations in distribution systems: comparison study and optimal planning. IEEE Trans. Power Syst. 29, 221–229 (2014)

416

Q. Liu et al.

19. Mak, H.-Y., Rong, Y., Shen, Z.-J.M.: Infrastructure planning for electric vehicles with battery swapping. Manag. Sci. 59, 1557–1575 (2013) 20. Xiong, H., Xiang, T., Rong, X., Chen, H.: Optimal allocation of electric vehicle battery swap stations. Electr. Power Autom. Equipment, 1–6 (2012) 21. Yang, J., Sun, H.: Battery swap station location-routing problem with capacitated electric vehicles. Comput. Oper. Res. 55, 217–232 (2015) 22. Liu, N., Chen, Q., Lu, X., Liu, J., Zhang, J.: A charging strategy for PV-based battery switch stations considering service availability and self-consumption of PV energy. IEEE Trans. Ind. Electr. 62, 4878–4889 (2015) 23. Cao, Y., Miao, Y., Jiang, Q.: Optimal operation of islanded microgrid with battery swap stations. Electr. Power Autom. Equipment, 1–6 (2012) 24. Technical specifications of remote service and management system for electric vehicles - Part 3: Communication protocol and data forma. In: National Technical Committee of Auto Standardization, Tech (2016)

SoProtector: Securing Native C/C++ Libraries for Mobile Applications Ning Zhang1, Guangquan Xu1(&), Guozhu Meng2, and Xi Zheng3 1

Tianjin Key Laboratory of Advanced Networking (TANK), School of Computer Science and Technology, Tianjin University, Tianjin 300350, China [email protected] 2 Nanyang Technological University, Singapore, Singapore 3 Department of Computing, Macquarie University, Sydney, Australia

Abstract. Java code is easy to be decompiled, and third-party SO files are used frequently by developers to improve development efficiency. Therefore, more and more core functions of Android applications are implemented in the native layer. However, there is neither comprehensive security research work nor automated security analysis tools on Android native layer, especially for thirdparty SO files that are dynamically loaded within the applications. To solve this problem, SoProtector, a novel and effective system is proposed to defend against the privacy leaks, which mainly analyzes the data stream between two levels: application and Native layers. In addition, SoProtector includes a real-time monitor to detect malicious functions in binary code. Our evaluation using 3400 applications has demonstrated that SoProtector can detect more sources, sinks and smudges than most static analysis tools; And it detects and effectively blocks more than 82% of applications that dynamically load malicious thirdparty SO files with low performance overhead. Keywords: Mobile security Android

 Mobile privacy  Native C/C++ libraries

1 Introduction At present, the privacy disclosure is still a serious problem in smartphone applications. Here are a few examples: (1) Facebook leaked the phone number from a mobile device before the user logged into the application [1]; (2) Angry Birds collected user data, which was found to be used by the NSA to profile users [2]; (3) out of 25,976 Android applications, 969 applications leaked location data and 347 recorded audio without the user’s permission [3]. Along with privacy concerns there are security concerns as well. Malware constitute the main media for security attacks against mobile devices. It has been recently reported [4] that almost 60 percent of existing malware send stealthy premium rate SMS messages. Also Google Play, the official market for Android apps, has hosted applications which have been found to be malicious [5]. In the past few years, malware have increasingly relied on root exploit. Some of the famous malware families include DroidKungfu [6], GingerMaster [7] and DroidDream [8]. These exploits allow for the © Springer Nature Switzerland AG 2018 J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 417–431, 2018. https://doi.org/10.1007/978-3-030-05057-3_32

418

N. Zhang et al.

escalation of privileges, which bypassed the security measures of the Android operating system. It allows the malware to have unlimited access to the device, resulting in the downloading and running of the payload to get information of users. Mostly important, a new family of malware (Godless [9]) using the root exploit that is stored in a native library, has emerged recently in 2016. The exploit binary contains a series of vulnerability that includes the Towelroot exploit (vulnerability numbers: CVE-2014-3153) and PingPong exploit (vulnerability numbers: CVE-2015-3636). This alarming trend of malware using native library code plays an important motivation for us to create a detection system to identify malware that contains such exploit [9]. For the detection of privacy leak (although there existed some detection frameworks [19–22]), the most important method is stain analysis, including static stain analysis and dynamic stain analysis. The main dynamic stain analysis tools are TainDroid [10] and AppFence [11]. The typical static stain analysis tools include FlowDroid [12] and AndroidLeaks [3]. However, static stain analysis tools cannot effectively handle the Android dynamic loading mechanism and the reflection mechanism, while dynamic stain analysis tools cannot generate data stream graph of the C/C++ programs (which generate SO files) on the native layer. Unfortunately, reference [13] points out that from 2010 to 2014, the proportion of malicious Android applications using dynamic loading and reflection mechanism increased from 43.87% to 78%, non-malicious applications from 55% to 93%. The large number of applications using dynamic loading techniques makes it increasingly difficult for current stain analysis tools to effectively detect privacy leaks in Android applications [14]. Standing on the security’s point, SO files (see Fig. 1) are binary code files, reference [15] measured that 1,161 insecure code snippets posted on Stack Overflow were copied and pasted into 1,305,820 Android applications available on Google Play. They demonstrated that the proliferation of insecure code snippets within the Android ecosystem and the reuse rate of insecure code (Java and C/C++ code) are high. Inspired by this situation, given two binary functions such as the malicious function and the unknown function, we could detect whether they are similar [16]. This problem is known as “binary code similarity detection.” Reference [17, 18] used a graph matching algorithm to check whether two functions’ control flow graph representations are similar to conduct a binary code similarity detection. Genius [19] learns high-level feature representations from the control flow graphs and encodes (i.e., embeds) the graphs into embeddings (i.e., high dimensional numerical vectors). However, the graph matching algorithms are slow, i.e., requiring super-linear runtime in the graph size. Thus such approaches are inevitably inefficient. In recent years, deep learning [20] has been applied to many application domains, including binary analysis [21], and has shown stronger results than other approaches. Reference [22] proposed a deep neural network-based approach to generate embeddings for binary functions for similarity detection. On the other hand, reference [27, 28] presented a method to extract important byte sequences in malware samples by application of convolutional neural network (CNN) to images converted from binary data. However, related work listed above is not in the ARM platform.

SoProtector: Securing Native C/C++ Libraries

419

Fig. 1. The SO libraries in Android APK

Contributions. In summary, this paper makes the following contributions. • Based on FlowDroid, we developed an effective stain analysis tool for the data interaction between native C/C++ libraries and Java API framework by changing Android source code, which is not yet solved by traditional static stain analysis and dynamic stain analysis methods. Through experiments, we also verified its effectiveness. • For reversible SO files, we designed an automation tool to analyze the combined characteristics of assembly code to detect whether they are malicious. We tested the performance of the automated tools and verified its effectiveness through experiments. For non-reversible SO files, we developed a new method to construct texture maps by combining image processing with machine learning to detect malicious variants. • For the third-party SO files called by dynamic loading mechanism, we first proposed to establish a real-time monitoring platform by uploading the SO files for online examining, monitoring changes of the third party SO Files. By changing the Android source code and combining with the tools of dynamic stain analysis, the monitoring platform is set up to monitor the third-party SO file loaded in the test APPs in real time. The validity of the method is verified through experiments. • We have created a malicious native program dataset, including their Android source programs and malicious binary SO files. Roadmap. The rest of this paper is organized as follows. Section 2 gives the motivation example, and then introduces some background knowledge about Android dynamic loading. Section 3 gives an overview of SoProtector and illustrates the key techniques applied for it. Section 4 describes the approach step by step. Section 5 describes the experiment and gives the evaluation. Section 6 discusses the limitation of SoProtector and concludes this work.

2 Problem Statement In this section, we investigate challenges in static analysis to analyze the SO files. We also give some background knowledge about some important mechanisms in the Android platform.

420

2.1

N. Zhang et al.

Background

Dynamic loading means that the Android application achieves some specific functions by loading some of the local non-existent executable files which can be replaced at run time; and Android NDK uses dynamic loading such as loading SO libraries and calling functions through JNI methods. SO libraries are generally compiled from C/C++, running in the native layer. Due to their much higher efficiency in the virtual machine layer, the SO libraries are often chosen instead of the native java code to do some work to meet performance requirements (such as T9 search, or Bitmap decoding, etc.). In addition, since the SO library is compiled by C/C++ and decompiled into assembly codes (sometimes they are hard to be understood), the SO library can also be used for other purposes. For instance, a new family of malware (Godless) uses the root exploit method whose code is stored in a native library [9]. In general, we package the SO libraries together inside the apps, but the SO libraries also can be loaded from the external storage files. 2.2

Challenges

SO libraries are binary files which are composed of 0 and 1: if we take security measures like pacify, maybe we are unable to get their assembly codes. There is no automated tool: we need to manually analyze the data interaction between the native and java layers: no automated tool means less efficient. In addition, dynamic loading SO libraries from third-party is becoming more and more popular during application development. Third-party so libraries do not need to be packaged directly in the APK (see Fig. 2).

APK

Old SO File

New SO File

SDCARD

Internet

Fig. 2. Third-party SO libraries can be updated from the Internet

With the program is running, the required SO files are loaded into the specified executable private directory. Because the using SO files are not in the application, static analysis cannot effectively analyze the data flows that involve on the native layer. More importantly, since the SO files can be updated at any time after the program is run and no user is required to reinstall the APK, if a malicious APK replaces the benign SO file with a malicious SO file after installing the security check, it will not be monitored and processed by security protection software.

SoProtector: Securing Native C/C++ Libraries

421

3 System Overview In this section, we give an overview of the SoProtector framework, which consists of SoDetection and SoPlatform and describe the key techniques applied in our framework. Figure 3 shows the overall architecture of SoProtector. In order to facilitate the following description, we make the following definitions of terms: (1) Source method: the method called from native layer, denoted by Sf, as shown in Listing 1 (e.g. JNITransmit method). (2) Source file: the C /C++ file where source method lies, denoted by Sw. (3) The target method: the native layer function, which calls the Java layer method, denoted by Tf, as shown in Listing 3. (4) Target class: the class where the target method lies in, denoted by Tc. JNI interaction method: The method invoked by the caller to implement the reflection mechanism, denoted by Jh, (e.g.GetMethodID and GetSaticMethodID methods). SoProtector consists of SoDetection and SoPlatform, SoDetection mainly consists of two parts: the dynamic execution module and the static analysis module. For the convenience of the following description, we abbreviate them as SoDetection-x and SoDetection-y respectively.

Fig. 3. System overview of the SoProtector

422

N. Zhang et al.

Information Extraction. SoDetection firstly runs the SoDetection-x on the computer, and the tested application is installed onto the Android device. The Android system installed in the Android device is generated after Android source codes are modified and recompiled. The SoDetection-x can record the relevant information like dynamic loading functions that occur in the output of the log. We will illustrate in the next section about some changes made to the Android system source code. After installing the application, SoDetection-x runs the application and reads the log output of the system cyclically to obtain information about application’s dynamic loading and reflection invocation. When capturing the application’s dynamic loading behavior, SoDetection-x would send the download command by adb to the Android device to download the SO files and.dex files to the local computer. When capturing the reflection calling behavior of the application, SoDetection-x would extract the source method named Sf, the JNI interaction method named Jh, the target method named Tf corresponding to the reflection call from the log and store the information by the form of a triplet in the local computer’s file, which is named SJT repository. The.dex files and the SJT repository will be used for subsequent static analysis, and the downloaded SO files will be used in SoPlatform’s work. Data Analysis. When the dynamic analysis module is finished, SoDetection will run SoDetection-y that is actually an improvement of FlowDroid. We added the dex files and SJT information library into SoDetection-y’s static stain analysis process. SoDetection-y firstly loads the required JavaClass files of the APK and the.dex files into memory, then it translates them into the three-address intermediate language named Jimple of Soot [20]. According to the SJT repository, SoDetection-y transformes between reflection methods and source methods so that it can construct the correct functions’ call graph. Malicious Detection by SoPlatform. SoPlatform firstly calculates the hash value of the SO file and stores it. The malicious code image, the OpCode n-gram, and the system call are used as the features. We used the DNN classifier, the Decision tree and the Random Forest as the machine learning algorithms for classification (to judge whether this SO files is or not the malicious files by setting a threshold (this data is defined by a specific large number of test sets) and if it was the malicious files we need to know which malicious native family it belongs to. The speed of the training is accelerated by instruments named xgboost and pypy. Notes: In order to improve the efficiency: if the app was replaced with a malicious SO file during the updating time, the hash value of it would be changed and reanalyzed. If the hash value did not change, the analysis would not be performed.

4 Implementation In this section, we explain the details of four stages in SoDetector’s implementation approach.

SoProtector: Securing Native C/C++ Libraries

4.1

423

Stage 1: Pre-processing

We modified the Android (its version is 4.1.2) source code. The main changes include: (a) We modified the Android source code so that it can store method parameters to get complete information in the method stack. (b) We hooked the Runtime Class by recording the mLibPaths so that when the external SO file is loaded, name and address of the SO file is recorded to facilitate SoProtector’s next steps. (c) Hook the GetMethodID and GetstaticMethodID methods. (d) Hook some invoke methods. (e) Hook the pNewEntry to get some process information. 4.2

Stage 2: Disposal by SoDetection-X

SoDetection-x will read the phone’s log information and extract the uid number as the log records of the tested app. When reading a record of a dynamically loaded dex file, the dex file is downloaded to a designated folder on the local computer according to the file information recorded in the log in order to provide analysis for SoDetection-y. When capturing the output information of GetMethodID, GetstaticMethodID, or the reflection calling information output by some invoke methods, the information of the corresponding source method Sf, JNI interaction method Jh, target method Tf is extracted, and stored as a triplet in the SJT repository in order to provide reflection calling information for SoDetection-y’s stain analysis process. Figure 4 describes the extraction principle for this information. Because GetMethodID and GetstaticMethodID method are called by the upper layer function, in order to capture the functions’ calling and displaying, there are two ways:

Cs

Push Stack

Fs

Sf

Ps

Jh

Tf

Push Stack Cr

Fr

Pr

Cs

Fs

Ps

Stack Analysis

Log Analysis Ct

Ft

Pt

Library

Fig. 4. The way to get the library

424

N. Zhang et al.

For the SO files that are able to be reversible to the ARM and C code, we need to use four tools: GCC, Addr2line tools, open source tools Pvtrace and Dot (see Fig. 5); we labeled the native functions where lie Tf by GCC detection functions, which generated the trace files named trace.txt; After using Addr2line tools to transform the address of the function into the function name, we could get the function calling graph by the dot and the map can be transformed into the data flow diagram of native layer added in the FlowDroid diagram. By analyzing the calling relationship, we could find Sf and Jh that Tf matches;

Source,sink and entry-point detection...

analyze C/C++ files restoring from XXX.SO

Parse manifest file,.dex file,layout xmls

Trace.txt from MinGb

E004013B3 E0030134C …...

Entrance Function Generate main method Buid call graph

Function calling graph

As the supplement to main graph

Perform taint analysis

Function A Function B Function C ...

Fig. 5. The way to deal with the reversible SO files

For the SO files that are irreversible or difficult to reverse, we triggered the native function by IDApro. We observed that the SO files’ different Paragraph information mapped to memory in the process table, got the code segment with execute permission and found its memory base address so as to get a series of DCB data. DCB data’s header string had the Sf and Jh that matches Sf. 4.3

Stage 3: Disposal by SoDetection-Y

SoDetection-y is an improved tool of FlowDroid, it will add.dex files and SJT information library (see Fig. 6) to the analysis process, so that it can correctly handle SO files which is dynamic loaded. Our analysis method is based on Soot’s Jimple language. At first, it will load all the classes into memory, and then generate the main method and build the function’s call graph. In order to reduce the memory burden, SoDetection-y takes the corresponding classes in the loaded.dex files only. When building the function call graph, SoDetection-y automatically adds the function call graph generated by disposal of SO files to the main graph. In a word, SoDetection-y deals with the reflection method (in the native layer and application layer) of the source files so as to form the complete control flow graph

SoProtector: Securing Native C/C++ Libraries

425

Fig. 6. Algorithm flow chart of processing SJT library

containing the data interaction of the application layer and the native layer. The algorithm flow chart of processing SJT library is shown in Fig. 6. For a SJT mapping, SoDetection-y firstly determines the method is getxxxid or invoke method. Then SoDetection-y converts the target reflection parameter Sp to the target parameter Tp; Then, it determines whether the target reflection object is empty; if it is empty, it will define and assign the target object. For the getxxxid method, it mainly judges initialization parameters from source data as shown in Fig. 6. The taint track between native C/C++ libraries and Java API framework is shown in Fig. 7. Our tool can distinguish the flow of private data between the application and native layers.

426

N. Zhang et al. Sw … String deviceID=hook.get DeviceId(); JNITransmit(deviceI D); ...

Sf

Tc Tf

public void sendFaker1(String number,String content) … SmsManager sms = SmsManager.getDefault();s ms.sendTextMessage(numbe r,null,content,null,null);} ...

Jh

Function(LoadLibrary and Load)can help us find Jh

..._JNITransmit(JNIEnv*env,j object thiz,jstring string1) {… jstring string2 = (*env)>NewStringUTF(env, "11111"); jmethodID sendFaker1 = (*env)->GetMethodID(env, class, "sendFaker1", "..."); (*env)->CallVoidMethod(env, object, sendFaker1, string2,string1); …}

Function(FindClass and GetmethodId) can help us find Tc and Tf

In Native layer

The trajectory track of the privacy data

Fig. 7. The taint track between native C/C++ libraries and Java API framework.

4.4

Stage 4: Processing by SoPlatform

SoPlatform is deployed to a remote server to detect the SO files and upload them to the server. If the SO file is uploaded for the first time, the server firstly calculates and stores the hash value of the SO file. The malicious code image, the OpCode n-gram, and the system call are used as the features. We used the DNN classifier, the Decision Tree and the Random Forest as the machine learning algorithms for classification (to judge this SO files is or not the malicious files by the setting threshold and if it was the malicious files we need to know which malicious native family it belongs to). Notes: In order to improve the efficiency: if the app was replaced with a malicious SO file during the updating time, the hash value of it would be changed and reanalyzed. If the hash value did not change, the analysis would not be performed. SO file is a binary file and the difficulty is how to get the malicious behavior characteristics of the malicious code. We can use heuristic scanning with unknown binary code detection. Heuristics is a static detection method which do not actually run the binary file with the highest efficiency (see the experimental part). Notes: SO files will have different segment information mapped to memory (including the data segment and the code segment, the process table will see a number of SO sub-paragraph), we need to find the code segment with executive SO segment. Next, we describe the three major features selected in detail: • Feature 1: Presenting a binary file as a gray scale image, using the texture features in the image to determine the maliciousness of the binary For a binary file, each byte ranges between 00 * FF, just corresponding to gray scale 0 * 255 (0 is black, 255 is white). Converting a binary file into a matrix (each byte in the matrix element corresponds to the size of the matrix, which can be adjusted according to the actual situation), the matrix can be easily converted into a gray scale. Specific implementation (by python):

SoProtector: Securing Native C/C++ Libraries

427

(1) We used hexlify function to transform a binary file into a hexadecimal string; (2) By byte segmentation, we used reshape function to create the rectangle according to the width set; (3) we used fromarray function to convert this rectangle into an image. The same family of malicious code images in the texture exists a certain similarity and different malicious code family is different. Using the GIST feature technology of computer vision, Using the GIST feature technology of computer vision, a fivedimensional perception dimension (vector) is used to describe the image. That is, an image is input and the corresponding GIST descriptor is output. After getting these vectors, classification training of machine learning algorithm can be done. • Feature 2: Opcode Sequence Frequency of Appearance The code of the SO file is reversely obtained by using the ARM instruction set. The opcode sequence is obtained by using the python. The sequence is processed according to the length of the sub-sequence as n (n is 1, 2, 3), and then we calculate the TF result of each opcode sequence. The vector S = (D1, D2,… Dn) consisting of the opcode sequence frequency is obtained. We combined two values above to the weighted vector V = (wtf1, wtf2,… wtfn). Also we calculated the vector V1 for the malicious SO files to be tested and the vector Vm+1 cosine similarity of m different kinds of malicious samples respectively. • Feature 3: Sequences of System API calls With the use of IDAPro, each file is disassembled into the assembly language and a gdl file that contains the assembly code will be generated. Since IDAPro can disassemble binary files into a basic block of assembly code, the gdl file will capture this valuable information. The system call (see Fig. 8) will be recorded into an output format of text file as an input to feed into a machine learning algorithm. The output files are used to model the behavior of the binary or native code that are in both malware and benign application. Then we will use machine learning algorithm such as random forest tree classifier for classification and detection of malware.

Fig. 8. Statistic of system call for AnserverBot Malware

428

N. Zhang et al.

5 Empirical Study In this section, we first present our empirical settings and then we present our evaluation results. 5.1

Empirical Settings

Dataset: We crawled 3000 apps from Wandoujia Store [24] (covering its pre-classified 15 categories) and 400 apps with Native layer from VirusShare [25] (with malware spanning from 2013 to 2017), Genome (apps are divided in 49 malware families), Contagio Mobile [26] (we have tested SoProtector against 13 malware families from the Contagio database) and reference [29]. For each category in the Wandoujia store, we downloaded the top 200 apps. Excepting some connection errors occurred in the crawling process, totally we collected 3400 apps as our dataset. This dataset will be used in both model training and evaluation of SoProtector. Notes: the malware native families include: (1) ADRD (2) AnserverBot (3) BaseBridge (4) Geinimi (5) Asroot (6) BeanBot (7) Bgserv (8) DroidKungFu1 (9) DroidKungFu2 (10) DroidKungFu3 (11) DroidKungFu4. Environment: The main hardware devices for the experiment were a Samsung S6 mobile phone (4-core processor) and an ASUS computer (8-core CPU with 8 GB of memory). We made the modified Android system into a ROM and flushed it into a Samsung mobile phone. All tested APK by SoProtector ran on this Android system. The Dell computer is used to run the main program of SoPlatform. 5.2

Overall Analysis and Performance

From Table 1 (note: the numbers in column “Content of Privacy” represent the following types of the private data that can be leaked: 1-call records, 2-geolocation information, 3-message records, 4-contacts, 5-mobile phone identification, 6-baidu accounts, 7-wifi information, 8-bluetooth information, 9-base station information, 10browser information. The marks in column “Ways of Privacy Leakage” represent the following ways through which the private data can be leaked: 1-network, 2-short message, 3-log, 4-file. The top 5 are benign apps and the last 5 are malware apps, details of these package are publicly available in our laboratory website [30]), we can see that SoDetection can detect more sources, sinks and smudges than FlowDroid. The reason is that it can effectively deal with dynamic loading and reflection mechanism. And we can see from Table 2 that non-malicious applications may not use dynamic loading mechanism, the reason is they could not load the third-party SO files, meanwhile the malicious applications whose malicious code is in SO files use dynamic loading mechanism totally. During our experiment, the disposal phase for the per application took 2 s. Pre-Processing time for applications is included in the disposal phase. The static analysis for the applications based on the dataset is processed in 23 threads concurrently. Since SoProtector mainly targets for customized system vendors or security analysts, we consider such overhead quite acceptable.

SoProtector: Securing Native C/C++ Libraries

429

Table 1. Effectiveness analysis APK ID

1 2

FlowDroid Sink Taint number propogation path number 35 0 24 3

SoDetection Sink Taint number propogation path number 41 4 30 7

3

72

12

74

19

4

44

21

44

26

5 6

86 14

0 5

86 17

0 7

7

29

14

30

15

8 9 10

0 14 11

0 5 4

1 16 17

1 6 9

Content of privacy

Ways of privacy leakage

{1,2} [3, 5], {3,5} [6, 10], {6,10} [1, 2], {1,2} {7,10} [3, 6], {3,6} [2,3], {2,3,4,5} {1,3} [4],{4,9} [1],{1,7}

{1} {1,2} [1],{1} [1, 3], {1,3,4} [1],{1} [1, 3], {1,3} [1, 2], {1,2} {1,2,4} [1],{1,2,3} [1, 3], {1,3,4}

Table 2. Mechanism analysis Type

Total With dynamic loading With invoke mechanism mechanism App number Proportion App number Proportion Non-malicious Apps 3000 1637 0.5456 1892 0.6306 Malicious Apps 400 400 1 135 0.3375

5.3

Precision

Based on the total number of TPs, FPs (non-malware apps mistakes as malware apps) and FNs (malware apps mistakes as non-malware apps) (2972, 375, 53), we compute the precision and recall of SoProtector as follows: based on the total number of TPs, FPs and FNs (973, 68, 103), we compute the precision and recall of SoDetection as follows: Precision ¼

TP TP þ FP

Recall ¼

TP TP þ FN

Overall, SoProtector precisely identified most of Apps, with 88.79% precision and 98.25% recall.

430

N. Zhang et al.

An important class of tested malware is the DroidKongFu app, which is a type of malware whose core code is in the native layer that became popular in the last years, especially in China, where SoProtector has successfully identified it’s malicious in the dataset. In particular, DuanXinGongJiQi is a recent malware (SMS Trojan) known to be able to evade most anti-virus in China. In these cases, SoProtector detected the misbehavior of the outgoing SMS message, typical of SMS Trojan. These results shows how the SoProtector approach is a valid and effective alternative to static stain analysis approaches, which are more accurate against malware whose core functions are in native layer, such as some apps are not detected by FlowDroid. By the way, in Sect. 4.4, due to the randomness in the random forest training process, each result is not the same. But in general, the accuracy of the combination of the two methods is much higher than each one, where the basic accuracy can reach more than 72%.

6 Discussion In this section, we discuss the general applicability of SoProtector, as well as limitations and future work. SoDetection can output the complete path of the application layer and the native layer pollution source to the sinking point, but the defect lies in the prevention of the implementation part of the api kernel in the in-depth linux kernel analysis, which affects the detection effect to a certain extent, which is also going to continue in the future Research work. SoPlatform needs to set up related servers and has the cost of network transmission, which has affected its efficiency to some extent. Acknowledgement. This work has been partially sponsored by the National Key R&D Program of China (No. 2017YFE0111900), the National Science Foundation of China (No. 61572355, U1736115), the Tianjin Research Program of Application Foundation and Advanced Technology (No. 15JCYBJC15700), and the Fundamental Research of Xinjiang Corps (No. 2016AC015).

References 1. Symantec index. http://www.symantec.com/connect/blogs/norton-mobile-insight-discoversfacebook-privacyleak 2. Ball index. http://www.theguardian.com/world/2014/jan/27/nsa-gchqsmartphone-app-angrybirds-personal-data 3. Gibler, C., Crussell, J., Erickson, J., Chen, H.: AndroidLeaks: automatically detecting potential privacy leaks in android applications on a large scale. In: Katzenbeisser, S., Weippl, E., Camp, L.Jean, Volkamer, M., Reiter, M., Zhang, X. (eds.) Trust 2012. LNCS, vol. 7344, pp. 291–307. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30921-2_17 4. Kaspersky index. http://usa.kaspersky.com/about-us/press-center/pressreleases 5. Symantec index. http://www..com/connect/blogs/yet-another-bunchmalicious-apps-foundgoogle-play 6. News index. https://www.csc2.ncsu.edu/faculty/xjiang4/DroidKungFu2/

SoProtector: Securing Native C/C++ Libraries

431

7. GingerMaster index. https://www.csc2.ncsu.edu/faculty/xjiang4 8. News index. https://blog.lookout.com/blog/2011/03/02/android-malware-droiddream-howit-works/. Accessed 4 Mar 2017 9. Liu, Z.: Verifiable searchable encryption with aggregate keys for data sharing system. Future Gener. Comput. Syst. 78, 778–788 (2018) 10. Enck, W.: TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans. Comput. Syst., 2–32 (2014) 11. Hornyack, P.: These aren’t the droids you are looking for: retrofitting Android to protect data from imperious applications. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 639–652 (2011) 12. Arzt, S.: Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps. ACM SIGPLAN Not. 49, 259–269 (2014) 13. Chen, X.: N-Mobishare: new privacy-perserving location-sharing system for mobile online social networks. Int. J. Comput. Math. 93, 384–400 (2018) 14. Li, T.: CDFS: a cryptographic data publishing system. J. Comput. Syst. Sci., 80–91 (2018) 15. Fischer, F.: Stack overflow considered harmful? the impact of copy & paste on android application security. In: IEEE Symposium on Security and Privacy (SP), pp. 121–136 (2017) 16. Xu, D.: Cryptographic function detection in obfuscated binaries via bit-precise symbolic loop mapping. In: IEEE Symposium on Security and Privacy (SP), pp. 921–937 (2017) 17. Eschweiler, S.: Efficient cross-architecture identification of bugs in binary code. In: The Network and Distributed System Security Symposium (2016) 18. Pewny, J.: Cross-architecture bug search in binary executables. In: IEEE Symposium on Security and Privacy, pp. 709–724 (2015) 19. Feng, Q.: Scalable graph-based bug search for firmware images. In: ACM SIGSAC Conference on Computer and Communications Security, pp. 480–491 (2016) 20. Geoffrey, H.: Deep learning. Nature 521, 436–444 (2015) 21. Richard, S.: Recognizing functions in binaries with neural networks. In: USENIX Security, pp. 611–626 (2015) 22. Xiao, J.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: ACM Conference on Computer and Communications Security, pp. 435–446 (2017) 23. Wang, H.: A secure, usable, and transparent middleware for permission managers on Android. In: IEEE Transactions on Dependable and Secure Computing, pp. 350–362 (2017) 24. Wandoujia Store Index. http://www.wandoujia.com/apps 25. VirusShare Index. https://virusshare.com 26. Krupp, B.: SPE: security and privacy enhancement framework for mobile devices. IEEE Trans. Dependable Sec. Comput. 14, 433–446 (2017) 27. Saracino, A.: MADAM: effective and efficient behavior-based android malware detection and prevention. IEEE Trans. Dependable Sec. Comput. 15, 83–97 (2018) 28. Tongxin, L.: Unleashing the walking dead: understanding cross-app remote infections on mobile WebViews. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 829–844 (2017) 29. Paranthaman, R.: Malware collection and analysis. In: 2017 IEEE International Conference on Information Reuse and Integration, pp. 26–31 (2017) 30. Files Websites index. http://cs.tju.edu.cn/csweb/cyxz

CloudPT: Performance Testing for Identifying and Detecting Bottlenecks in IaaS Ameen Alkasem ✉ (

)

, Hongwei Liu, and Decheng Zuo

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China [email protected], {liuhw,zdc}@hit.edu.cn

Abstract. This work addresses performance testing for monitoring mass quan‐ tities of large-dataset measurements in infrastructure-as-a-Service (IaaS). Phys‐ ical resources are not virtualized in sharing dynamic clouds; thus, shared resources compete for access to system resources. This competition introduces significant new challenges when assessing the performance of IaaS. A bottleneck may occur if one system resource is critical to IaaS; this may shut down the system and services, which would reduce the workflow performance by a large margin. To protect against bottlenecks, we propose CloudPT, a performance test manage‐ ment framework for IaaS. CloudPT has many advantages: (I) high-efficiency detection; (II) a unified end-to-end feedback loop to collaborate with cloudecosystems management; and (III) a troubleshooting performance test. This paper shows that CloudPT efficiently identifies and detects bottlenecks with a minimal false-positive rate ( 3), then the error class is “Serious”, while the node fault state is “yes” (fault/abnormal).

Fig. 10. AFBD Algorithm proposed for fault bottlenecks/anomalies behavior detected

4.3.2 Diagnosis Engine Stage In this stage, we proposed then implemented algorithms that combined the test data with the exercising dataset model to generate an intermediate table and classify the job. This allowed us to concurrently compute the probability of every component in the three classes via the java and Scala language coding programs and for the algorithms, this is displayed in Appendix A.1, Fig. 16. Our simple dataset outlines a probability model based on the state predicted component usage. Here, the measurement level differs wildly from 0–100%. We observed the percentage component utilization at discrete times t1,…,tn. The component usage utilized CPU∈{0–25%, 26–75%, 76–100%} as thresholds [27]. The new dataset held outcomes on the model classifier by combining the test and training datasets; it evolves when using the proposed algorithm, as shown in Fig. 17 in Appendix A.1. In the process, we initially defined our task to categorize the three states utilizing the new model. When one component is faulty, the system cannot work. We employed 0, 1, and 2 to embody the system and the component situations. In this case, 0 represented good status (standard working situations), 1 signified a minor error, and

444

A. Alkasem et al.

Fig. 11. A classification of a probability dataset with final result fault states (no or yes)

2 symbolized a server error. For instance, the CPU, memory and network signify 3 simple modules while host state server signifies the system state (Fig. 12). The results represent only three classes (normal, minor, and serious) for the component state and two (yes, no) for the system fault state. To ease the problem, we chose the same amount of normal, minor, and serious measures for all the models (see Fig. 13(A)–(B)) [27].

CloudPT: Performance Testing

445

Fig. 12. An NBC Model predicting the components utilization for a host VM system

Fig. 13. Results of the algorithms proposed

5

Experimental and Evaluation Results

5.1 OpenStack OpenStack [28] is open source software that can control massive amounts of storage, computing, and network resources at a datacenter. Typically, one manages it on a dash‐ board or on an OpenStack API. We introduced 40 irregularities into the OpenStack online service of the host server [29], which resulted in faults/anomalies for global resource consumption (see Fig. 14(A)–(B)). These 40 irregularities represent the extreme failure source issues one can identify within online services [21].

446

A. Alkasem et al.

Additionally, in our experiments, we ran each application independently on Hadoop and Spark, respectively, for 24 h. We ran their respective benchmarks on 1.2 GB to 12 GB datasets, a throughput of systems of different model implementations, as shown in Fig. 15. We observed patterns of CPU utilization when the testbed displayed the projected performance for the host server confirming our hypotheses. We gathered the metrics of VMs and the host using a fault troubleshooting method for 4 s. Throughout this period, we injected glitches into the testbed system.

Fig. 14. CPU utilization parameters using Ganglia monitoring metrics

Fig. 15. System throughput of experiments using Hadoop and Spark

5.2 Evaluation and Experiment Results The statistical measures, recall, precision, and accuracy, along with exactness were used to assess whether the fault diagnosis was effective using Apache Spark and NBC for the massive dataset problem. As displayed in Table 2, we utilized four statistical measures [26, 30] to evaluate CloudPT’s effectiveness in identifying and eradicating bottlenecks. A successful anomaly/fault bottleneck recognition was defined by the program diag‐ nosing the irregularity carefully utilizing the fault type identification (type, size, loca‐ tion) and conferring to the affected host VM and metrics. CloudPT is the first end-toend performance testing management framework that can troubleshoot, analyze, classify and suggest repair actions for virtualized cloud based fault bottlenecks and anomalisms.

CloudPT: Performance Testing

447

Table 2. Four statistical measures Precision

Recall

Accuracy

successfuldetections oftotalalarms

successfuldetections oftotalanomalies

2 ∗ precision ∗ recall precision + recall

False-alarm rate (FAR) 1 − presision

Overall, the all-around feedback loop performance effectiveness of the CloudPT diagnosed performance bottlenecks in 20 s. The results showed an 86% improvement in the Accuracy (F1) score compared with the theoretical method, with a standard false alarm rate of Static thresholds > 90%

25

16

0.40

0.64

0.48

0.36

CloudPD

Problem definition in sharing dynamic cloud

44

32

0.80

0.72

0.76

0.28

5.3 Performance Testing Overheads CloudPT uses non-virtualization cloud assets to test the performance of bottlenecks and behavior anomalies. We quantified the overhead of CloudPT according to CPU, memory and network overhead utilization. For our experiments, we made bottlenecks and considered the failure of a host VM to startup. We used the virtualization’s resources averaged across VMs and over the 24-hour experiment duration; this is represented in Table 3. It is evident that CloudPT presents minimal overhead on the system. Hence, our experimental study confirms the effectiveness of CloudPT’s accuracy and frequency in detecting bottlenecks and anomaly faults in accumulation to having a low cloud system overhead.

6

Conclusion

We proposed an Apache Spark-based bottleneck troubleshooting performance frame‐ work, called CloudPT for IaaS. The proposed framework includes three construction troubleshooting measures: (I) data collection, (II) analysis and classification engine implementation, and (III) decision engine implementation. The objectives of CloudPT

448

A. Alkasem et al.

are to monitor collections, develop analysis, and classify the attributes of measurements, as opposed to the individual metric thresholds, by extending the detect of faults into troubleshooting. In general, the framework focuses on monitoring the shared virtualized resource measurements to address the problems that lead to failure bottlenecks. More specifically, CloudPT troubleshoots all apparent bottlenecks or anomalies by using precomputed fault notifications. Through this framework, we also measured and modelled CPU utilization, memory usage, and the network overhead. CloudPT troubleshoots all apparent bottlenecks or anomalies using pre-computed fault notifications. Simultane‐ ously, it allows recovery to occur in an automated model that is integrated into cloud management services. We conducted a comprehensive assessment of CloudPT on two representative cloud workloads: Hadoop and Spark. Shortly thereafter, we also conducted a host VM startup failure case study. The outcomes of all the experiments demonstrate that CloudPT attains significant accuracy with a low occurrence of false alarms; in short, it efficiently identifies and eradicates bottlenecks and behavior anoma‐ lies. One area of future work will mainly cover the development of additional features for the CloudPT, such as recovering and self-healing. Acknowledgments. We are also thankful to anonymous reviewers for their valuable feedback and comments for improving the quality of the manuscript.

CloudPT: Performance Testing

449

Appendix A A.1. A Proposed Algorithms

Fig. 16. Algorithm for training, filtering and streaming dataset based on Hadoop and Spark

450

A. Alkasem et al.

Fig. 17. Algorithm for combining the testing and training datasets classification and evaluation results

References 1. Malli, S.S., Soundararajan, V., Venkataraman, B.: Real Time Big Data Analytics to Derive Actionable Intelligence in Enterprise Applications, Internet of Things and Big Data Analytics Toward Next-Generation Intelligence, pp. 99–121. Springer, Cham (2018) 2. Gregg, B. Systems Performance: Enterprise and The Cloud. Pearson Education, New Jersey 3. Alsheikh, M.A., Niyato, D., Lin, S., Tan, H.P., Han, Z.: Mobile big data analytics using deep learning and apache spark. IEEE Network 30(3), 22–29 (2016) 4. Performance-testing (2017). http://www.softwaretestinghelp.com/what-is-performancetesting-load-testing-stress-testing/ 5. Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research challenges. J. Internet Serv. Appl. 1(1), 7–18 (2010)

CloudPT: Performance Testing

451

6. Alkasem, A., Liu, H., Decheng, Z., et al.: AFDI: A Virtualization-based Accelerated Fault Diagnosis Innovation for High Availability Computing, arXiv preprint arXiv:1507.08036 (2015) 7. High CPU utilization but low load average (2017). https://serverfault.com/questions/667078/ high-cpu-utilization-but-low-load-average/667089 8. Alkasem, A., Liu, H., Zuo, D.: Utility cloud: a novel approach for diagnosis and self-healing based on the uncertainty in anomalous metrics. In: Proceedings of the 2017 International Conference on Management Engineering, Software Engineering and Service Sciences, pp. 99–107. ACM (2017) 9. Zhai, Y., Xu, W.: March. efficient bottleneck detection in stream process system using fuzzy logic model. In: Euromicro International Conference on Parallel, Distributed and Networkbased Processing (PDP), pp. 438–445. IEEE (2017) 10. Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013) 11. Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G., et al.: Anomaly-based network intrusion detection: techniques, systems and challenges. Comput. Secur. 28(1), 18–28 (2009) 12. Massie, M., et al.: Monitoring with Ganglia: Tracking Dynamic Host and Application Metrics at Scale. O’Reilly Media, Inc., Massachusetts (2012) 13. Barth, W.N.: System and Network Monitoring. No Starch Press, San Francisco (2008) 14. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016) 15. Sharma, B., Praveen, A., Chita, R.D.: Problem determination and diagnosis in shared dynamic clouds. In: 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE (2013) 16. Cherkasova, L., Ozonat, K., Mi, N., Symons, J., Smirni, E.: Automated anomaly detection and performance modeling of enterprise applications. ACM Trans. Comput. Syst. (TOCS) 27(3), 1–32 (2009) 17. Kumar, A., Shankar, R., Choudhary, A., Thakur, L.S.: A big data MapReduce framework for fault diagnosis in cloud-based manufacturing. Int. J. Prod. Res. 54(23), 7060–7073 (2016) 18. Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z.: Online optimization for scheduling preemptable tasks on IaaS cloud systems. J. Parallel Distrib. Comput. 72(5), 666–677 (2012) 19. Alkasem, A., Liu, H., Shafiq, M., Zuo, D.: A new theoretical approach: a model construct for fault troubleshooting in cloud computing. Mobile Inf. Syst. 2017, 16 (2017). https://doi.org/ 10.1155/2017/9038634. Article ID 9038634 20. SivaSelvan, N., Haider, M.Y., Selvan, N.S., Hegde, G.: Design and Development of Performance Management System (2016) 21. Wang, C., Talwar, V., Schwan, K., Ranganathan, P.: Online detection of utility cloud anomalies using metric distributions. In: Network Operations and Management Symposium (NOMS). IEEE (2010) 22. Bertino, Elisa, Catania, Barbara: Integrating XML and databases. IEEE Internet Comput. 5(4), 84–88 (2001) 23. Barham, P., Boris, D., Keir, F., Steven, H., et al.: Xen and the art of virtualization. In: ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 164–177. ACM (2003) 24. Riddle, A.R., Soon, M.C.: A survey on the security of hypervisors in cloud computing. In: 2015 IEEE 35th International Conference on Distributed Computing Systems Workshops (ICDCSW), pp. 100–104. IEEE (2015) 25. Gelman, A., John, B.C., Hal, S.S., Donald, B.R.: Bayesian Data Analysis, vol. 2. Chapman & Hall/CRC, Boca Raton (2014)

452

A. Alkasem et al.

26. Doane, D.P., Lori, E.S.: Applied Statistics in Business and Economics. Irwin, New York (2005) 27. Alkasem, A., Liu, H., Zuo, D., Algarash, B.: Cloud computing: a model construct of realtime monitoring for big dataset analytics using apache spark. J. Phys: Conf. Ser. 933(1), 012018 (2018) 28. Jackson, K.: OpenStack Cloud Computing Cookbook. Packt Publishing Ltd, Birmingham (2012) 29. Kumar, V., Karsten, S.S., Yuan, C., Akhil, S.: A state-space approach to SLA based management. In: Network Operations and Management Symposium NOMS 2008 IEEE, pp. 192–199. IEEE (2008) 30. Alkasem, A., Liu, H.: A survey of fault-tolerance in cloud computing: concepts and practice. Res. J. Appl. Sci. Eng. Technol. 11(12), 1365–1377 (2015)

Smart Grid Power Trading Based on Consortium Blockchain in Internet of Things Dong Zheng1,2(B) , Kaixin Deng1 , Yinghui Zhang1,2(B) , Jiangfan Zhao1 , Xiaokun Zheng3 , and Xinwei Ma1 1

National Engineering Laboratory for Wireless Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, People’s Republic of China [email protected], [email protected], [email protected], [email protected], [email protected] 2 Westone Cryptologic Research Center, Beijing 100070, China 3 School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, People’s Republic of China [email protected]

Abstract. Internet of Things (IoT) technologies have attracted enormous attention from academics and industries, and one of the most representative application is the smart grid. Most smart grid system models have to rely on trusted third-parties, but there are no trusted thirdparties in practice. Blockchain technologies show a lot of advantages in IoT due to its unique characteristics. In this paper, to enable reliability, efficiency, flexibility and security in smart grid trading, we combine blockchain technologies, proof of stake consensus mechanisms and cryptography tools to build a novel smart grid power trading system. Our security analysis shows that the proposed system can protect users’ data privacy. Keywords: Smart grid · Blockchain Internet of Things · Energy market

1

· Smart contracts

Introduction

In future smart grid designs, users can use renewable energy such as solar energy and wind energy to convert them into storable electricity to reduce power companies’ dependence on fossil fuels [9]. Users can complete trading with companies or other users through gateways [20]. It is easy to cause privacy disclosure while Supported by National Key R&D Program of China (No. 2017YFB0802000), National Natural Science Foundation of China (No. 61772418, 61472472, 61402366), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2018JZ6001, 2015JQ6236). Yinghui Zhang is supported by New Star Team of Xi’an University of Posts and Telecommunications (No. 2016-02). c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 453–459, 2018. https://doi.org/10.1007/978-3-030-05057-3_34

454

D. Zheng et al.

the information is not be encrypted. In recent years, many technologies were used to protect Internet of Things (IoT) security [6,8,12,15]. Electric power companies as trusted third-parties are vulnerable to suffer attacks, and there still exists some security issues [1,12]. In view of these security threats, it is urgent to design a safe and reliable decentralized system to ensure that the interests of users and companies are not violated. Blockchain is defined as a distributed database that records transactions of value using a cryptographic signature that is inherently resistant to modifiction [11]. It allows to have a distributed peer-to-peer network where non-trusting members can interact with each other without a trusted intermediary [2]. Famous WannaCry blackmail virus used bitcoin as the payment currency [3], which makes more and more people aware of the uniqueness of the blockchain. With the development of cloud computing [16,19] and wireless network technologies [13,14], blockchain technologies have been used in outsourcing services for payment [18] and keyword search [17] in cloud computing and intelligent control in energy market [4,10], and consortium blockchain has high potential to establish decentralized electricity trading system with moderate cost [7]. However, most of existing schemes adopt proof of work (PoW) consensus mechanisms or private chains, where PoW is wasteful and private chains are not decentralize in essence. In the future smart grid systems, using blockchain technology, constructing a decentralized peer-to-peer network system can bring the system more security and flexibility. This paper presents a smart grid power trading system. The main contributions of this paper are two-fold. – For one thing, we adopt the proof of stake (PoS) consensus mechanism instead of PoW to present a new architecture of smart grid power trading system. Our architecture overcomes the shortcomings of the 51% attack, which is the most common attack method in the blockchain. – For another, for the security problems of users’ uploading data to the authorized nodes, we use cryptography to encrypt data collected by the sensors. Organization. The remaining of this paper is organized as follows. In Sect. 2, we describe the proposed system in detail together with the proposed system architecture. In Sect. 3, we give the security analysis of the proposed system. We draw our conclusions in Sect. 4.

2 2.1

Smart Grid Power Trading System System Architecture

Figure 1 presents an overview of the model, which uses blockchain as the protocol layer, rely on the Ethereum to run the smart contract, and uses the PoS consensus mechanism, completes power trading with the help of market. The system has these following levels:

Smart Grid Power Trading Based on Consortium Blockchain in IoT

455

Fig. 1. System model

(1) User layer. Users register in the system through smart meters [5] with true identity, system returns his public key and certificate, the certificate can be used to uniquely identify the user node through binding registration information of the user. Then users generate their public key and private key by their own. Users can price electrical energy through interactive devices, set the expected sales price and electricity waiting for smart contract processing. (2) Authorized nodes layer. The system establishes authorized nodes based on the users’ geographical distribution for the user to participate in the system. A consensus reached between authorized nodes according to user needs. When the user’s transaction requirements are satisfied, the smart contract is automatically executed. (3) Power company layer. In system, power companies play the role of power balance energy storage. Performing big data analysis and forecasting according to the regional electricity situation, and complete the estimation of the current regional peak time, sending request information through contract during low electricity period for low-cost electricity purchase. Power companies are also important power carriers to transmit power over long distances. (4) PoS consensus mechanism. We use PoS mechanism instead of PoW, it is based on the energy converted from renewable energy and the time it is stored for interest release and block generation. If a region is rich in electricity resources, due to PoS mechanism will have interest compensation for emptying more power areas, the authorized node will charge a part of the commission by agreement, then distribute most of the interest to users. Users will be more willing to sell the electricity at a lower price than the market price.

456

D. Zheng et al.

2.2

System Details

System Initialization. The smart meter first needs to register through the authorized node to participate in the system and become the legal node in the system. When new user U seriu involves in the system, obtains the certificate Certui and node public key P Ku used to encrypt the sensing data from the authorized nodes N odeu and generate the user’s own public and private keys {P Kiu , SKiu }. u represents the uth community and i represents the ith user in u. The new user’s smart meter will download the current system’s block data storage location index table from the authorized node’s records, after which the synchronization list may be obtained from the nearby smart meters through the P2P network. The process is expressed as follows: N odeu → U seriu : {P Ku , Certui }

(1)

Authorized Data Upload. The user or power company’s smart meter senses the electrical energy converted from renewable energy from the energy storage unit to further collect sensory data dataui ; The user sets the price for selling or purchasing electricity on the interactive device according to the current market price pui , then set up to sell or buy electricity xui , the smart meter packages the data and encrypts the signature, which is passed to the authorized node. Upload data using pseudonyms and digital signatures to ensure the integrity and authenticity of data. The process is expressed as follows: U seriu → N odeu : dataipx = EncP Ku (dataui ||Certui ||Sigiu ||timestamp) Among them:

dataui = EncP Kiu {data||pui ||xui }

(2) (3)

Authorized Node Validation. When the N odeu receives the data, it uses SKiu to decrypt dataipx for verifying user’s identity. If the information is valid, it can be saved in the data record pool for the next processing; If the information is not secure or invalid, the data is ignored. PoS Consensus Mechanism Operation. The PoS consensus mechanism has a unique concept: coindays, the currency multiplied by the number of holding days. The authorized node is responsible for running smart contract and the generation of blocks. Total electricity used to generate coindays for block generation, as long as the node holding electricity, no matter how many can be dug to data blocks, without using any mineral pools it will not cause computing power concentration. At the same time, it reduces the resource consumption because of the use of coindays to generate blocks instead of computing power. If a new PoS block is discovered by the authorized node, it will clear the coindays to gain pay to compensate users and the authorized node.

Smart Grid Power Trading Based on Consortium Blockchain in IoT

457

If a authorized node N odeu consumes coindays and generated a new block data block, the node integrates the data sets dataut received from other nodes during the integration, and attach the signature Sigu and the hash value of the new data blocks, broadcast to other authorized nodes. The process is expressed as follows: dataut = {dataupx1 ||dataupx2 || · · · ||dataupxn ||timestamp}

(4)

N odeu → All N ode : (dataut ||data hash||Certu ||Sigu ||timestamp)

(5)

Among them:

data hash = Hash(dataut ||timestamp)

(6)

Reply. After the other authorized nodes receive the node’s broadcast, they verify the legitimacy and correctness of the data block through the block hash and digital signature, and broadcast the audit result to other authorized node N odel with their signatures. After the N odeu receives and summarizes all the audit results, it signs and sends a reply to the master node. N odel → N odeu : reply = EncP Ku (result sets||Certl ||Sigl ||timestamp) (7) Among them: result sets = {result1 ||result2 || · · · ||resultl }

(8)

Writing in the Chain. The authorized node N odeu can decrypt the replies with its own private key SKu after receiving the reply from other authorized nodes. If the audit results of other nodes pass, then the N odeu will put the audit result into the data block and write it into the main chain, so as to get the system reward. Contract Operation. There is a virtual trading market in our model. Authorized node packaged power sales and quotes are broadcast on the entire network, smart contracts will be automatically executed according to the needs of users and power companies by running scripts, such as searching electricity prices from low to high, according to the user’s estimated price for intelligent sales, buy electricity at a lower price; power companies can buy electricity in the low valley period and sell electricity in the peak period according to the smart contract. Power Transfer. When the smart contract is completed, the smart meter will conduct electricity dispatching according to the data broadcasted by the authorized nodes, and obtain or pay the corresponding digital currency.

3

Security Analysis

The system proposed in this paper utilizes asymmetric encryption technology and has good resistance to traditional security attacks. Through the

458

D. Zheng et al.

cryptographic authentication mechanism, the attacker cannot crack the encrypted information within the effective time; by adding a time stamp to the data information, attackers cannot launch the replay attack; by using the digital signature technology in the data information, it can prevent attackers from forging fake data or tampering with data. In blockchain security, our system does not require a reliable third-party. The data is backed up at each authorized node. A small number of nodes that are attacked cannot affect the collapse of the entire system. System uses pseudonyms to ensure the privacy of the user’s personal information so that nodes cannot obtain the true identity. This article uses smart contracts to share data, restricts data access rights, and makes transactions transparent. The PoS mechanism will submit its consumed coindays to each block in order to increase the score of the block. The block with the highest depletion coindays will be selected as the main chain. In PoW mechanism, if someone has more than 50% of the computing power, he can mine the block faster than others, so he actually has the absolute right to the block, such as undo the payment transactions. Our design reduces the worries of the 51% attack, because in the PoS consensus mechanism, the 51% attack need to be controlled in a large number of coins, the cost may be higher than the 51% computing power, so it increases the cost of the attack.

4

Conclusion

With the rapid development of smart grid systems, centralized data storage methods are increasingly difficult to deal with attacks. Turning data centered storage into distributed storage is the future trend. This paper proposes a smart grid trading system based on the consortium blockchain, relies on smart contracts and PoS consensus mechanism, enables users and operators who maintain the nodes to form a win-win situation. In order to further improve the security of the system, we consider improving the consensus algorithm based on the PoS consensus to ensure that the verifier of the highest-value deposit in each block can operate the blockchain in the best profit model.

References 1. Aitzhan, N.Z., Svetinovic, D.: Security and privacy in decentralized energy trading through multi-signatures, blockchain and anonymous messaging streams. IEEE Trans. Dependable Sec. Comput. (2016). https://doi.org/10.1109/TDSC.2016. 2616861 2. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the internet of things. IEEE Access 4, 2292–2303 (2016) 3. Crowe, J.: Wannacry ransomware statistics: the numbers behind the outbreak. https://blog.barkly.com/wannacry-ransomeware-statistics-2017/ 4. Etemad, R.H., Lahouti, F.: Resilient decentralized consensus-based state estimation for smart grid in presence of false data. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3466–3470. IEEE (2016)

Smart Grid Power Trading Based on Consortium Blockchain in IoT

459

5. Han, Q., Zhang, Y., Chen, X., Li, H., Quan, J.: Efficient and robust identity-based handoff authentication in wireless networks. In: Xu, L., Bertino, E., Mu, Y. (eds.) NSS 2012. LNCS, vol. 7645, pp. 180–191. Springer, Heidelberg (2012). https://doi. org/10.1007/978-3-642-34601-9 14 6. Li, J., Zhang, Y., Chen, X., Xiang, Y.: Secure attribute-based data sharing for resource-limited users in cloud computing. Comput. Secur. 72, 1–12 (2018) 7. Li, Z., Kang, J., Yu, R., Ye, D., Deng, Q., Zhang, Y.: Consortium blockchain for secure energy trading in industrial internet of things. IEEE Trans. Ind. Inform. (2017). https://doi.org/10.1109/TII.2017.2786307 8. Liu, Y., Zhang, Y., Ling, J., Liu, Z.: Secure and fine-grained access control on e-healthcare records in mobile cloud computing. Future Gener. Comput. Syst. 78, 1020–1026 (2018) 9. Mahmoud, M.M., Saputro, N., Akula, P.K., Akkaya, K.: Privacy-preserving power injection over a hybrid AMI/LTE smart grid network. IEEE Internet Things J. 4(4), 870–880 (2017) 10. Mannaro, K., Pinna, A., Marchesi, M.: Crypto-trading: Blockchain-oriented energy market. In: AEIT International Annual Conference, pp. 1–5. IEEE (2017) 11. Mylrea, M., Gourisetti, S.N.G.: Blockchain for smart grid resilience: exchanging distributed energy at speed, scale and security. In: Resilience Week (RWS), pp. 18–23 (2017) 12. Zhang, Y., Zheng, D., Deng, R.H.: Security and privacy in smart health: efficient policy-hiding attribute-based access control. IEEE Internet Things J. 5(3), 2130– 2145 (2018) 13. Zhang, Y., Chen, X., Li, H., Cao, J.: Identity-based construction for secure and efficient handoff authentication schemes in wireless networks. Secur. Commun. Netw. 5(10), 1121–1130 (2012) 14. Zhang, Y., Chen, X., Li, J., Li, H.: Generic construction for secure and efficient handoff authentication schemes in EAP-based wireless networks. Comput. Netw. 75, 192–211 (2014) 15. Zhang, Y., Chen, X., Li, J., Li, H., Li, F.: FDR-ABE: attribute-based encryption with flexible and direct revocation. In: International Conference on Intelligent Networking and Collaborative Systems (INCoS), pp. 38–45. IEEE (2013) 16. Zhang, Y., Chen, X., Li, J., Wong, D.S., Li, H.: Anonymous attribute-based encryption supporting efficient decryption test. In: Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, pp. 511–516. ACM (2013) 17. Zhang, Y., Deng, R.H., Jiangang, S., Kan, Y., Dong, Z.: TKSE: trustworthy keyword search over encrypted data with two-side verifiability via blockchain. IEEE Access 6, 31077–31087 (2018) 18. Zhang, Y., Deng, R.H., Ximeng, L., Dong, Z.: Blockchain based efficient and robust fair payment for outsourcing services in cloud computing. Inf. Sci. 462, 262–277 (2018) 19. Zhang, Y., Li, J., Chen, X., Li, H.: Anonymous attribute-based proxy re-encryption for access control in cloud computing. Secur. Commun. Netw. 9(14), 2397–2411 (2016) 20. Zhang, Y., Zhao, J., Zheng, D.: Efficient and privacy-aware power injection over AMI and smart grid slice in future 5G networks. Mob. Inf. Syst. 2017, 1–11 (2017)

Energy-Efficient Offloading in Mobile Edge Computing with Edge-Cloud Collaboration Xin Long , Jigang Wu(B) , and Long Chen School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China [email protected], [email protected], [email protected]

Abstract. Multiple access mobile edge computing is an emerging technique to bring computation resources close to end mobile users. By deploying edge servers at WiFi access points or cellular base stations, the computation capabilities of mobile users can be extended. Existing works mostly assume the remote cloud server can be viewed as a special edge server or the edge servers are willing to cooperate, which is not practical. In this work, we propose an edge-cloud cooperative architecture where edge servers can rent for the remote cloud servers to expedite the computation of tasks from mobile users. With this architecture, the computation offloading problem is modeled as a mixed integer programming with delay constraints, which is NP-hard. The objective is to minimize the total energy consumption of mobile devices. We propose a greedy algorithm with approximation radio of (1 + ε) as well as a simulated annealing algorithm to effectively solve the problem. Extensive simulation results demonstrate that, the proposed greedy algorithm can achieve the same application completing time budget performance of the Brute Force optional algorithm with only 31% extra energy cost.

Keywords: Mobile edge computing Remote cloud · Task dependency

1

· Cooperate · Greedy algorithm

Introduction

The recent tremendous growth of various wireless devices and diverse applications has brought the challenge in wireless systems. Since the proliferation of smart mobile devices and wearable sensors, mobile traffic and computation tasks have increased dramatically. Therefore, cloud computing [2] as well as 5G communication [5,9] has been proposed to deal with this challenge in the big data era. Despite the potential in data storage and analysis, cloud computing cannot fulfill the growing application requirements such as low latency and context awareness. Multiple-access mobile Edge Computing (MEC) [13] that serves as a complement for cloud computing can potentially overcome the weakness of c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 460–475, 2018. https://doi.org/10.1007/978-3-030-05057-3_35

Energy-Efficient Offloading in Mobile Edge Computing

461

mobile cloud computing by offloading computation intensive tasks at the edge of wireless networks. Task allocation and computation resource assignment are crucial to MEC, especially in the presence of an application with a large number of delay sensensitive subtasks. For example, on-line gaming for recreation or face recognition for security purposes. Those tasks should be handled in time taking the finite bandwidth and limited computation resources into consideration. The offloading problem that taking into consideration the above factors jointly are usually mixed integer programming problems which are non-convex and NP-hard [7,8]. Among the task allocation and resource assignment schemes, energy optimization is one of the key factors that affect the performance of the computation resource limited mobile devices. That’s because the energy consumption of mobile devices would exponentially grow when there are multiple complex tasks on the devices. Earlier works on energy optimization for MEC, such as [3,17], assumed unlimited energy supply of edge servers. Bi et al. [3] addressed the computation rate maximization problem in wireless powered MEC networks. Mobile devices can harvest energy from the cellular base station that with an MEC server. The original problem was non-convex and a decoupled optimization with coordinate descent method was proposed to solve the proposed problem. Lyu et al. in [17] studied the total energy consumption of multiple devices with latency constraints. The problem was modeled as a mixed-integer programming, followed by a dynamic programming algorithm based on Bellman equation. More recent researches [4,14] have been focused on delay minimization with energy or budget constraints of edge servers. Chen et al. [4] carried out with a novel multi-cell MEC architecture where edge devices such as base stations can cooperate with remote server on task execution. Considering the ON/OFF nature of edge servers, they used Lyapunov optimization technique to obtain optimal decisions on task offloading. Considering task dependency, Kao et al. [14] presented Hermes, aiming at minimizing total execution time of tasks with user budget constraints (Table 1). Based on the literature reviews, task dependency was not properly investigated by [3,4,17], which is important for real deployment. Although task dependency was used in the model by [14], authors in [14] merely neglected the influence of remote cloud servers. Moreover, all the above works assume the remote cloud server can be viewed as a special edge server or the edge servers are willing Table 1. Comparison between existing works and this work. Existing works

[3]

[4]

Hermes [14] [17]

This work

Task dependency

No

No

Yes

No

Yes

Edge-cloud collaboration

No

No

No

No

Yes

Energy constraint of users No

Yes

Yes

No

Yes

Server utility constraint

No

No

No

No

Yes

Objective

Computation rate Delay Delay

Energy Energy

462

X. Long et al.

to cooperate. In real scenarios, the remote cloud server has higher computation capability than the edge server and the transmission delay between edge cloud and remote server cannot be neglected when designing proper offloading schemes. Take face recognition as an example. The feature extraction tasks for face images obtained by individual mobile devices can be offloaded to edge servers while the machine learning and face recognization, i.e., image matching tasks can be executed on the remote cloud servers. Therefore, with edge-cloud cooperation, the target faces can be detected with certain bounded delay for distributed mobile devices. In this work, we investigate computation offloading decision and resource allocation problem with given delay requirements of mobile applications. The objective is to minimize sum energy consumption of mobile devices. Different from above works, we take edge-cloud cooperation into account, which being new challenges for the energy optimization problem. Since there are heterogeneous network resources, it is necessary to determine which the computation tasks should be done at remote clouds, processed at edge servers or local mobile devices. From the perspective of edge and remote cloud servers, their service for mobile devices should be compensated for the cost of execution and their profits should be guaranteed. Since the tasks of one application is delay bounded, how to handle edge-cloud cooperation with user budget constraints should be carefully designed. The main contributions of this paper can be summarized as follows: – A novel edge-cloud cooperation architecture is proposed in wireless heterogeneous network with edge servers deployed at small-cell base stations and remote cloud servers connected to the macro-cell base station. The edge server can hire remote edge servers to process some of the tasks originated from mobile devices. – The offloading problem is modeled as a mixed integer non-linear programming, which is NP-hard. We then propose a greedy algorithm as well as a simulated annealing algorithm to effectively solve the problem. – To provide incentive for edge servers, we propose a pricing scheme with virtual currency from mobile users to edge servers and remote cloud servers for the dedication of servers serving mobile users. The remainder paper is organized as follows. System model and computation model are presented in Sect. 2. Section 3 presents the problem formulation. The proposed algorithms is described in Sect. 4. Section 5 presents the performance evaluation. Section 6 concludes this paper with future remarks.

2

System Model and Computation Model

This section firstly describes the system model and formulates the offloading problem for energy saving with local computing, edge computing and the collaboration between edge and cloud servers.

Energy-Efficient Offloading in Mobile Edge Computing

463

Fig. 1. System architecture

2.1

System Model

As shown in Fig. 1, each edge server is located at the access point (AP) [6] which is also being attached by multiple mobile devices. The edge server is deployed at the AP and is linked to the remote cloud via high speed fiber links. Let U be the set of mobile devices, We assume that there are M mobile devices. Therefore, we have U = {u1 , u2 , u3 , · · · , uM }, where M ≥ 1. Meanwhile, there is a set Tm subtasks on the m-th mobile device, which cloud be denoted as Tm = {τm,1 , τm,2 , τm,3 , · · · , τm,N }, where N ≥ 0. Next, we will introduce the communication and computation models for mobile devices, edge servers and remote cloud in detail. 2.2

Communication Model

m f ∈ {0, 1}, Xm,n ∈ Transmission between Mobile Devices and Edge. Let Xm,n c {0, 1} and Xm,n ∈ {0, 1} each represents the computation offloading policy made m = 1 denotes that the subtask n by the m-th mobile device. Particularly, Xm,n f = 1 denotes that the subtask on mobile device m is executed locally while Xm,n c = 1 denotes n of mobile device m is executed on the edge server. Similarly, Xm,n that the subtask n on mobile device m is executed on the remote cloud. We can compute the uplink data rate for wireless transmission between mobile device and edge server as [5]:   m Pm,n Gm,n  , (1) Rm,n = W log2 1 + 2 mG σm + i=m,j=m Pi,j i,j m is the transmission power of mobile device m to upload the subwhere Pm,n task n to the edge server via AP, Gm,n is the channel gain between the mth mobile device and the corresponding AP when transmitting subtask n. 2 Gm,n = (dis−η m,p ) |hm,p | where dism,p denotes the Euclidean distance between

464

X. Long et al.

mobile device and edge server, hm,p is the corresponding Rayleigh fading channel coefficient that obeys the distribution of N (0, 1) [20]. The surrounding noise 2 [20]. power at the receiver, i.e. the AP, is σm It should be noted that, for the benefit of presentation, the downlink transmission rate is represented by the corresponding uplink rate. In the following expressions, we also utilize the expression of uplink transmission delay to represent the downlink transmission delay. That’s because the downlink transmission rate is usually a few times larger than the uplink transmission rate due to the channel allocation result of network operator. With this change, we can reduce the complexity of delay and energy cost expressions, which will be described in detail in following paragraphs (Table 2). Table 2. Basic notations Notation

Descriptions

M N Dm,n Wm,n Rm,n ttm,n trm,n

Number of mobile devices Number of subtasks Data size of subtask n on mobile device m Workload of subtask n on mobile device m Uplink data rate for subtask n of mobile device m Time spent when sending subtask n of device m to edge server Time spent when sending subtask n of device m from edge server to remote cloud Energy cost during transmission between mobile device and edge server for subtask n of device m Energy cost during transmission between edge and cloud for subtask n of mobile device m The delay when executing subtask n locally Energy consumption when executing subtask n of device m Completing time of subtask n on mobile device m that executed locally Energy cost during the completing time of subtask n on device with local computing Budget or allowed delay threshold for subtasks on device m Total energy cost for all subtasks of device m Total time consumed for all subtasks of mobile device m Profit of the edge server Offloading policy for subtask n of device m on local computing Offloading policy for subtask n of device m on edge computing Offloading policy for subtask n of device m on remote execution

t Em,n r Em,n

tlm,n l Em,n l T Fm,n l EFm,n Budgetm Em T Fm Upf l Xm,n f Xm,n c Xm,n

Energy-Efficient Offloading in Mobile Edge Computing

465

The transmission delay of subtask n between mobile device m and the corresponding edge server thus can be [11] ttm,n =

Dm,n , Rm,n

(2)

where ttm,n represents the time spent on sending the subtask n on mobile device m to the edge server, while Dm,n is the data size of the subtask n of device m. Based on the above equations, we can obtain the energy consumption when transmitting subtask n of mobile device m to the edge server as t m t Em,n = Pm,n tm,n ,

(3)

tx where Pm,n is the power of mobile device m when sending subtask n.

Transmission between Edge and Cloud. Due to fact that the edge server links the remote cloud via wired connection, the delay of data transmission from edge server to the cloud thus is Dm,n , (4) trm,n = ω where trm,n denotes the transmission delay for subtask n of mobile device m from edge server to the cloud. ω denotes the upstream bandwidth. Given the transmission delaybetween edge and remote cloud trm,n and the transmission r can be expressed as power P0 , Em,n r Em,n = P0 trm,n ,

(5)

r where Em,n is the energy consumed when sending the subtask n of mobile device m from edge to the cloud.

2.3

Computation Model

l be the CPU clock speed of mobile device Computation on Local Device. Let fm m and Wm,n be the workload of subtask n of mobile device m, if the subtask n on mobile device m is executed locally, then the subtask’s execution time is

tlm,n =

Wm,n . l fm

(6)

Given the computation time tlm,n , the energy consumed for subtask n of mobile device m for local computing is 2

l l = kWm,n fm . Em,n

By default, k is set as 10−11 following [12].

(7)

466

X. Long et al.

Computation on Edge. Let f f be the CPU frequency of edge server, if the subtask n of mobile device m is executed on the edge server, the computation time of the edge server can be tfm,n =

Wm,n , ff

and the energy cost of edge server can be expressed as:    σ f Em,n = αf f f + βf tfm,n .

(8)

(9)

According to [19], αf and βf are the positive constants which can be obtained by offline power fitting and σ ranges from 2.5 to 3. If subtask n of mobile device m is executed on the cloud, the computation delay and energy cost of remote cloud are as follows: Wm,n , (10) tcm,n = fc and

2.4

σ

c = (αc (f c ) + βc ) tcm,n . Em,n

(11)

Dependency Constraints

Definition 1. Subtask’s completing time: subtask n of mobile device m can only start when all its predecessor subtasks has been completed. The completion time for the nth subtask of mobile device m is consisted of two parts: the time spent to obtain the results of all its predecessor tasks and the time spent for its own computation. Definition 2. Energy cost to accomplish one subtask: it is also consisted of two parts: the energy spent getting the result of predecessor tasks and the energy spent for its own execution. Base on the above definitions, if subtask n of mobile device m is assigned to be executed locally, its completion time can be expressed as: t 

f l c T Fm,n tm,n + trm,n + tlm,n , = maxk∈pre(n) Xm,k ttm,n + Xm,k (12) and the energy cost for local completion is f  t  l t c r l EFm,n Xm,k Em,n Em,n + Em,n + Em,n = + Xm,k .

(13)

k∈pre(n) f c Xm,k = 0. The notation pre(n) in (12) means all the preIn (12) and (13), Xm,k

f decessor subtasks of the nth subtask. In (12), the term Xm,k ttm,n is the delay to obtain the predecessor subtask’s result of the nth subtask, if the predecessor sub- t c tm,n + trm,n task of n is executed on the edge server. Similarly, the term Xm,k

Energy-Efficient Offloading in Mobile Edge Computing

467

is the delay to obtain the result if the predecessor subtask of n is accomplished on the cloud server. If subtask n of mobile device m is assigned to be executed on the edge server, the completion time of subtask n can be defined as:  m t  f c = maxk∈pre(n) Xm,k tm,n + Xm,k trm,n + tfm,n , (14) T Fm,n m is predecessor subtask’s assignment strategy on mobile device. where Xm,k m = 1 means the kth subtask is computed on the local mobile device, while Xm,k m m t Xm,k = 0, otherwise. The term Xm,k tm,n is the delay to transmit the result of c trm,n is the predecessor task from mobile device to the edge server while Xm,k delay to send the prior result from the remote cloud to the edge server. f be the energy cost for subtask n of device m executed on the edge Let EFm,n server, similarly as (13), it can be defined as   f m t c r f Xm,k + Em,n EFm,n = Em,n + Xm,k Em,n . (15) k∈pre(n)

Similarly as (12) and (14), if subtask n of mobile device m is assigned to be executed in the remote cloud, its completion time can be expressed as

t  f c m T Fm,n tm,n + trm,n + Xm,k = maxk∈pre(n) Xm,k trm,n + tcm,n , (16) and the corresponding energy cost to complete the subtask on the remote cloud, c is EFm,n

 t  f c m r f c Xm,k + Em,n Em,n + Em,n + Xm,k EFm,n = Em,n . (17) k∈pre(n)

2.5

Utility Constraints

Next, we drive the utility constraints of edge server and the time budget for the completion time. The utility of edge server is Upf =

N M 

 f r c P f Xm,n , − Em,n Xm,n

(18)

m=1 n=0

where Upf is the utility of the edge server, P f is service price of edge server.

3

Problem Formulation

In this section, we will present the problem formulation with constraint of time budget and utility constraint Upf . Firstly, the completion time of all tasks on mobile device m can be defined as T Fm =

N  m  l f f c c Xm,n T Fm,n , + Xm,n T Fm,n + Xm,n T Fm,n n=0

(19)

468

X. Long et al.

l where T Fm,n is the task completion time of subtask n if it is executed locally, f T Fm,n is the task completion time of subtask n if it is executed on the edge c is the task completion time of subtask n if it is executed on server and T Fm,n the remote cloud. The total energy consumption of one application, which is denoted as Em is

Em =

N   l l f f c c EFm,n , Xm,n + EFm,n Xm,n + EFm,n Xm,n

(20)

n=0 l where EFm,n is the energy consumption of subtask n if it is executed on the f is the energy cost of subtask n if it is executed on edge mobile device, EFm,n c server and EFm,n is the energy cost of subtask n if it is executed on the remote cloud. In this work, the goal is to minimize the total energy consumption of tasks while meeting the completion time constraint. Meanwhile, the utility of the edge server Upf is guaranteed. The energy consumption minimization problem thus can be defined as: OPT − 1 obj : min Em

C1 : Upf > 0, C2 : T Fm < Budgetm , C3 :

m Xm,n

f c ∈ {0, 1}, Xm,n ∈ {0, 1}, Xm,n m f c Xm,n + Xm,n + Xm,n = 1, n ∈

∈ {0, 1}, n ∈ [0, N ], m ∈ [1, M ],

C4 :

[0, N ], m ∈ [1, M ].

Where constraint C1 is the utility constraint which guarantees the positive utility of the edge server. C2 is the task completion time budget, i.e., the delay constraint. C3 lists binary constraints and C4 is the unique solution constraint, which means that one subtask can only be executed at one place. Theorem 1. The sum task completion energy minimization problem for computation offloading in this study is NP-hard. Proof. We transform the oriental problem depicted in OP T − 1 and consider a special case that the mobile device, edge server and remote cloud server are with the same configurations, which result in the same energy costs and executing time when executing tasks. Regarding each subtask as a goods with value and weight, then the value corresponds to the execution time while the weight corresponds to the energy cost. Then we ignore the task dependency constraint between subtasks as well as the constraint C1. C2 can then be viewed as the knapsack’s value constraint. Therefore, the relaxed problem of OP T − 1 has changed into a knapsack problem [15] which is NP-hard. Therefore, the original problem OP T − 1 is also NP-hard, which concludes this proof.

Energy-Efficient Offloading in Mobile Edge Computing

4 4.1

469

Algorithms Gain Method

Based on the above models and analysis, first of all, we design a greedy method named Gain to minimize the energy consumption of mobile device m when finish executing tasks. To acquire the minimum energy cost of all subtasks in an application on mobile device m, the minimum energy cost of subtask n is selected l f c , EFm,n , EFm,n . This subtask-procedure is shown between Lines 1 from EFm,n to 11 of Algorithm 1. Then, we iteratively adjust the initial offloading policy to fit for the constraint of Upf and the completion time budget Budgetm . If the offloading policy does not satisfy the constraint of Upf , which means that the number of subtask executed on remote cloud is too much to make the edge server get profits when serving mobile users. To fit the constraint of Upf , we must offload some subtasks from the remote cloud to mobile device or to the edge servers. Then the algorithm chooses subtask considering which subtask will be offloaded. To obtain the minimum energy cost, we take the changing energy cost as the criteria to set the priority. The smaller the changing energy cost is, the higher the priority will be. To fit for the constraint of completion time budget, we compute the changing completion time and the changing energy cost in each offloading choice. We choose the corresponding offloading strategy in the choice, which decreases the changing completion time and guarantees the minimum changing of energy cost. Due to the constraint of utility Upf , the choosing of offloading site for subtasks should be very careful. If subtask n is assigned to be executed on mobile device, the offloading choice must be from mobile device to the edge server. If subtask n is assigned to be executed on edge server, the offloading choice must be from edge server to mobile device. If subtask n is assigned to be executed on remote cloud, the offloading choice can either be from the remote cloud to edge serve or from the remote cloud to mobile device. The detail of the Gain algorithm is depicted in Algorithm 1. Theorem 2. The time complexity of the Gain algorithm is O(N ). Proof. In Algorithm 1, the time complexity of subprocess from line 1 to 12 is O(N ) and the time complexity of subprocess from line 14 to 31 is O(N ) for the reason that the adjust time of time won’t be more than N . So the time complexity of the Gain algorithm is O(N ). Theorem 3. the approximation ratio is (1 + ε). Proof. Due to limited space,omitted

5 5.1

Performance Evaluation Simulation Setup

To study the performance of proposed algorithms, We implement the algorithms on a high performance work station with an Intel I7 processor at frequency 3.9

470

X. Long et al.

Algorithm 1. Gain method for mobile device m Input: tasks: a sequence of N subtask-tasks mobile device m, the execute order of subtasks; W : the workload size of subtasks; D: the data size of subtasks; Budgetm : the completion time budget for subtasks; pre: 2-D array for each subtask’s predecessor’s task; Output: X m : the policy of subtask executed on mobile device locally; X f : the policy of subtask executed on edge server; X c : the policy of subtask executed on remote cloud; 1: for n in tasks do l f c (13), (15), (17) 2: computer m,n , EFm,n  by Equation  EFl m,n , EF f c l then 3: if min EFm,n , EFm,n , EFm,n = EFm,n l f c 4: Xm,n ← 1, Xm,n ← 0, Xm,n ←0 5: end if   l f c f = EFm,n , EFm,n , EFm,n then 6: if min EFm,n l f c 7: Xm,n ← 0, Xm,n ← 1, Xm,n ← 0 8: end if   l f c c = EFm,n , EFm,n , EFm,n then 9: if min EFm,n l f c 10: Xm,n ← 0, Xm,n ← 0, Xm,n ← 1 11: end if 12: end for 13: compute Upf and T Fm 14: while Upf ≤ 0  T Fm ≥ Budgetm do 15: if Upf ≤ 0 then 16: choose the subtask that bings about minimum changing energy consumption when offloading the subtask from the remote cloud to the edge server, or from the remote cloud to mobile device. 17: end if 18: if T Fm ≥ Budgetm then 19: for n = 0 → N do m = 1 then 20: if Xm,n 21: compute the changing energy cost when offloading the subtask from mobile device to the edge server. 22: end if f = 1 then 23: if Xm,n 24: compute the changing energy cost when offloading the subtask from the edge server to mobile device 25: end if c = 1 then 26: if Xm,n 27: compute the changing energy cost when offloading the subtask from remote cloud to mobile device or from remote cloud to edge server. 28: end if 29: choose the offloading policy with the minimum changing energy cost and decrease changing completing time 30: end for 31: end if 32: end while

Energy-Efficient Offloading in Mobile Edge Computing

471

GHz and has a 8G RAM. We use Python 3.6 [1] to simulate the offloading of subtasks and evaluate the algorithms in terms of running time, application completion time and energy cost with 100 repeated trials. In order to simulate real-world tasks, we use a typical task graph as shown in Fig. 2. In Fig. 2, dependency constraints exists between subtasks, which determine the execution order. Based on the task graph, one possible execution sequence for subtasks is [0, 1, 2, 3, 4, 5, 6, 7].

Fig. 2. The task graph.

We set 8 subtasks in an application with evenly distributed workload and evenly distributed data size. The signal noise between the edge server and mobile device is set as σ 2 = 1, the wireless bandwidth of upload is set as W = 2 Mbps and the wireless bandwidth of download is set as W = 10 Mbps [10]. The bandwidth between edge server and remote cloud of upload is W = 1024 Mbps and the bandwidth between edge server and remote cloud of download is W = 8192 Mbps [10]. The CPU frequency of mobile device is f m = 5 × 106 Hz, while the CPU frequency of edge server is f f = 2 × 109 Hz [18]. The CPU frequency of remote cloud is set as f c = 4 × 109 Hz [18]. System parameters αf = 0.1, βf = 0.1, αc = 0.2, βc = 0.2 [18]. The communication chip power of mobile device is 0.1 watt [16]. The communication chip power of edge server is 1watt [16] and the communication chip power of remote cloud is 3 watt [16]. 5.2

Simulation Result

Figure 3 shows the comparisons of Gain, Brute Force and SA in terms of running time with different workload sizes. From Fig. 3, we observe that, the running time of Brute Force ranges from 7.54 s to 7.68 s and the running time of Gain is less than 0.02 s. That is because the Brute Force tries to exhaustively search all solutions and the solution space of the problem is N 3 , where N denotes the number of subtasks. From Fig. 3, we can observe that, the running time of three algorithms stay almost no fluctuations, which indicates the robustness of algorithms. For example, in Brute Force, the maximum running time is 7.66 s, while the minimum running time is 7.547 s, the difference value between the maximum running time and

472

X. Long et al.

the minimum running time is only 0.12 s. In Gain, the maximum running time is 0.0015 s and the minimum running time is 0.001 s.

Fig. 3. The comparisons of three algorithms’ executing time with different workload size.

Fig. 4. The energy cost of Gain and Brute Force with the change of workload size.

Figure 4 show the comparisons of Gain, Brute Force on energy cost with different workload sizes. In Fig. 4, The Brute Force always obtains the minimum energy cost compared with the other algorithm. From the comparison between Brute Force and Gain, we observe that Gain can optimally achieve the same completion time budget performance of optimal result with only 31% extra energy cost averagely. The energy cost of Gain approximates the optimal result, especially for case when the workload sizes are 87.5 M and 262.5 M. In Fig. 4, when the workload size grows from 43.75 to 87.5 M, the energy cost also increases by 0.06 KJ but the energy cost falls by 0.04 KJ when the workload size grows from 87.5 M to 131.25 M due to the constraint of task dependency. From Fig. 4, the change of curve in energy consumption of Gain is almost the same as the changes of curves in energy consumption of Brute Force. Figure 5 shows the comparisons of application completion time of Gain, Brute Force. The completion time budget Budget can be represent as (21) and W denotes the workload matrix, N denotes the number of subtask of the mobile device m, which is, N Wm,n . (21) Budget = 0.5 × n=0

From Fig. 5, we observe that the completion time of Gain and Brute Force are always lower than that of the completion time budget. Therefore Gain and Brute Force always obtain efficient solutions which satisfy the completion budget. When the workload size increases from 43.7 M to 206.25 M, the completion time of Gain also increases from 1.05 s to 6.22 s, because the greater the workload is, the longer time the Gain will be. In Fig. 6, we can see the completion time of Gain occupies 40% to 80% of the completion time budget. While the completion time of Brute Force occupies 22% to 80% of the completion time budget, which is optimal.

Energy-Efficient Offloading in Mobile Edge Computing The percentage of budget change with workload

Application completing time change with workload

0.9

10

Brute Force Gain

Brute Force Gain Budget

9

473

0.8

8 The percentage of budget

Application completing time

0.7 7

6

5

4

0.6

0.5

0.4

3 0.3 2

1 0

50

100

150 200 Workload (a)

250

300

350

Fig. 5. The comparisons of application completion time of Gain, Brute Force and Budget on different workload size.

6

0.2 0

50

100

150 200 Workload (b)

250

300

350

(b)

Fig. 6. The comparisons of application completion time as a percentage of Budget.

Conclusions

This paper has addressed novel computation offloading schemes with device, edge server and remote cloud collaboration. We have formulated the offloading problem as an energy cost minimization problem with application completion time budget and edge server profit’s constraint. The problem is NP-hard. We have designed a Gain algorithm aimed to minimize the energy cost, which also follows the constraints of completion time, utility and task dependency. After extensive simulation, we can obtain following finding. Firstly, the implementation shows that in a three-tier structure such as mobile, edge server and remote cloud, edge server plays a very important role in reducing the energy consumption during task execution. Secondly, the proposed greedy algorithm can achieve the same application completion time performance of the Brute Force optimal algorithm with only 31% extra energy cost on average. In the future, we will devise online algorithms by modifying the initialization process of each algorithms and explore the energy cost minimization problem with completion time constraint of each subtask. Acknowledgment. This work was supported by the National Natural Science Foundation of China under Grant Nos. 61702115 and 61672171, Natural Science Foundation of Guangdong, China under Grant No. 2018B030311007, and Major R&D Project of Educational Commission of Guangdong under Grant No. 2016KZDXM052. This work was also supported by China Postdoctoral Science Foundation Fund under Grant No. 2017M622632. The corresponding author is Jigang Wu ([email protected]).

474

X. Long et al.

References 1. Aksimentiev, A., et al.: Python for scientific computing (2007) 2. Barbera, M.V., Kosta, S., Mei, A., Stefa, J.: To offload or not to offload? the bandwidth and energy costs of mobile cloud computing. In: 2013 Proceedings IEEE INFOCOM, pp. 1285–1293. IEEE (2013) 3. Bi, S., Zhang, Y.J.A.: Computation rate maximization for wireless powered mobileedge computing with binary computation offloading. IEEE Trans. Wirel. Commun. PP(99), 1–14 (2018). https://doi.org/10.1109/TWC.2018.2821664 4. Chen, L., Zhou, S., Xu, J.: Energy efficient mobile edge computing in dense cellular networks. In: 2017 IEEE International Conference on Communications (ICC), pp. 1–6. IEEE (2017) 5. Chen, L., Wu, J., Dai, H.N., Huang, X.: BRAINS: joint bandwidth-relay allocation in multi-homing cooperative D2D networks. IEEE Trans. Veh. Technol. 67, 5387– 5398 (2018). https://doi.org/10.1109/TVT.2018.2799970 6. Chen, L., Wu, J., Zhou, G., Ma, L.: QUICK: QoS-guaranteed efficient cloudlet placement in wireless metropolitan area networks. J. Supercomput. 74, 1–23 (2018). https://doi.org/10.1007/s11227-018-2412-8 7. Chen, M.H., Dong, M., Liang, B.: Joint offloading decision and resource allocation for mobile cloud with computing access point. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3516–3520 (2016) 8. Chen, M.H., Liang, B., Dong, M.: Joint offloading and resource allocation for computation and communication in mobile cloud with computing access point. In: INFOCOM 2017 IEEE Conference on Computer Communications, pp. 1–9. IEEE (2017) 9. Dhillon, H.S., Ganti, R.K., Baccelli, F., Andrews, J.G.: Modeling and analysis of K-Tier downlink heterogeneous cellular networks. IEEE J. Sel. Areas Commun. 30(3), 550–560 (2012) 10. Ding, L., Melodia, T., Batalama, S.N., Matyjas, J.D.: Distributed routing, relay selection, and spectrum allocation in cognitive and cooperative ad hoc networks. In: Sensor Mesh and Ad Hoc Communications and Networks, pp. 1–9 (2010) 11. Dinh, T.Q., Tang, J., La, Q.D., Quek, T.Q.S.: Offloading in mobile edge computing: task allocation and computational frequency scaling. IEEE Trans. Commun. 65(8), 3571–3584 (2017) 12. Guo, S., Xiao, B., Yang, Y., Yang, Y.: Energy-efficient dynamic offloading and resource scheduling in mobile cloud computing. In: IEEE INFOCOM 2016 the IEEE International Conference on Computer Communications, pp. 1–9 (2016) 13. Hu, Y.C., Patel, M., Sabella, D., Sprecher, N., Young, V.: Mobile edge computing. A key technology towards 5G. ETSI White Paper 11(11), 1–16 (2015) 14. Kao, Y.H., Krishnamachari, B., Ra, M.R., Fan, B.: Hermes: Latency optimal task assignment for resource-constrained mobile computing. In: IEEE Conference on Computer Communications (ICC), pp. 1894–1902 (2015) 15. Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack Problems. Springer, Heidelberg (2004) 16. Liu, P.J., Lo, Y.K., Chiu, H.J., Chen, Y.J.E.: Dual-current pump module for transient improvement of step-down DC-DC converters. IEEE Trans. Power Electr. 24(4), 985–990 (2009) 17. Lyu, X., Tian, H., Ni, W., Zhang, Y., Zhang, P., Liu, R.P.: Energy-efficient admission of delay-sensitive tasks for mobile edge computing. IEEE Trans. Commun. 66, 2603–2616 (2018). https://doi.org/10.1109/TCOMM.2018.2799937

Energy-Efficient Offloading in Mobile Edge Computing

475

18. Park, C.B., Park, B.S., Uhm, H.J., Choi, H., Kim, H.S.: IEEE 802.15.4 based service configuration mechanism for smartphone. IEEE Trans. Consum. Electr. 56(3), 2004–2010 (2010). https://doi.org/10.1109/TCE.2010.5606358 19. Rao, L., Liu, X., Ilic, M.D., Liu, J.: Distributed coordination of internet data centers under multiregional electricity markets. Proc. IEEE 100(1, SI), 269–282 (2012). https://doi.org/10.1109/JPROC.2011.2161236 20. Zhang, L., et al.: Primary channel gain estimation for spectrum sharing in cognitive radio networks. IEEE Trans. Commun. PP(99), 1 (2016)

Quantitatively Investigating Multihop Localization Errors in Regular 2-D Sensor Networks Bing Jia1,2 , Baoqi Huang1,2(B) , Tao Zhou1,2 , and Wuyungerile Li1,2 1

2

Inner Mongolia A.R. Key Laboratory of Wireless Networking and Mobile Computing, Hohhot 010021, China [email protected] College of Computer Science, Inner Mongolia University, Hohhot 010021, China

Abstract. In practice, a wireless sensor network normally includes a small portion of nodes with known locations, termed anchors, and the other nodes with unknown locations, termed sensors, have to be localized through dedicated algorithms. Since not every sensor is directly neighboring with anchors, sensor locations are determined in a multi-hop localization manner, and therein, localization errors of sensors display to rise up with their minimal hop count to anchors, which is termed error propagation. Grasping the rule of error propagation is critical to design and develop both localization algorithms as well as various applications. In this paper, we focus on quantitatively measuring how the localization errors vary across different sensors. To do so, regular 2-dimensional wireless sensor networks are taken into consideration, and formulae with respect to different sensors and different anchor placement are obtained. Simulation results are conducted to validate these formulae and analyze the characteristics of error propagation. Keywords: Localization errors Error propagation

1

· Wireless sensor network · Multihop

Introduction

In wireless sensor networks (WSNs), sensor locations are key prerequisite for many applications and techniques, such as reporting the geographic origin of events, assisting in target tracking, and achieving geographic aware routing. Therefore, considerable effort has been invested in the development of localization systems [1–6]. Range-based sensor localization [7] is the problem of identifying the locations of sensor nodes, or simply sensors, given estimates of the Supported by the National Natural Science Foundation of China under Grants 41761086, 41401519, 61461037, 61761035 and 61661041, the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant No. 2016YFB0502102, the Natural Science Foundation of Inner Mongolia Autonomous Region of China under Grant 2017JQ09, and the Grassland Elite Project of the Inner Mongolia Autonomous Region under Grant CYYC5016. c Springer Nature Switzerland AG 2018  J. Vaidya and J. Li (Eds.): ICA3PP 2018, LNCS 11336, pp. 476–488, 2018. https://doi.org/10.1007/978-3-030-05057-3_36

Quantitatively Investigating Multihop Localization Errors

477

distances between them, known as range measurements. The basic range-based localization algorithms include trilateration which in 2-dimensional (2-D) space employs at least three range measurements from non-collinear nodes at known locations, termed anchors, to localize a sensor. Due to the existence of noises in range measurements, only location estimates as opposed to exact positions can be derived. If not every sensor can measure its distances to sufficient anchors, already localized sensors must be used as pseudo-anchors to help their neighboring sensors become localized; this process is called multihop sensor localization. As a result, localization errors of pseudo-anchors propagate into localization results of later localized sensors, a phenomenon which is called error propagation. In the literature, discussions in [8,9] have raised series concerns about error propagation. In particular, error propagation in regular 1-dimensional WSNs was examined by obtaining the closed-form Cram´er-Rao Lower Bound (CRLB) in [10,11], and some key conclusions were drawn on how fast the error is propagated and how anchor placement affects error propagation have been reported. Moreover, error propagation in 2-D random WSNs was studied through a semiquantitative approach in [9], in the sense that the influences of the sensor density and the hop count from a sensor to anchors have been investigated based on certain approximations. As such, it is still challenging to precisely measure how localization errors are propagated in 2-D WSNs. As a preliminary study, we shall investigate the phenomenon of error propagation through exact formulation of localization errors in regular 2-D WSNs. To be specific, we firstly discuss the error propagation in a specific bilateration WSN, in which nodes with odd labels and nodes with even labels are regularly deployed with equal spaces in two parallel and horizontal straight lines, and obtain the formulae for localization errors through a linearized Maximum Likelihood Estimator (MLE); then, we generalize this specific bilateration WSN to a new bilateration WSN by integrally moving one horizontal line towards one horizontal direction, and extend the error formulae as well. These formulae accurately describe how the localization errors in the considered bilateration WSNs increase with the corresponding hop count to anchors increasing, and also demonstrate that different anchor placement will result in dramatically different localization performance. Finally, we explore the error propagation of a regular 2-D sensor network consisting of multiple bilateration networks, and formulate localization errors through CRLB which essentially equals to perturbed Toeplitz block matrices [12]. Simulation results are presented to validate the formulae obtained in this paper and illustrate the characteristics of error propagation in regular 2-D WSNs. The remainder of this paper is organized as follows. Section 2 establishes the problem model of a specific bilateration network WSN (SBWSN) and give the solving process of multihop localization errors. Section 3 presents the solving process of multihop localization errors in general bilateration WSN (GBWSN) with both the two anchors and the four anchors. Section 4 discusses the rate of error propagation of a regular 2-D WSN and Sect. 5 concludes the paper.

478

2 2.1

B. Jia et al.

The Specific Bilateration WSN The SBWSN with Anchors on One Side

The SBWSN with anchors on one side is illustrated in Fig. 1. As can be seen, nodes with odd labels are located in the same horizontal straight line and so are nodes with even labels; nodes 1, 2 are in the same vertical straight line, so on and so forth; the left-most two nodes are anchors, and all the others are sensors, edges denote range measurement between two nodes. Define the following notations.

Fig. 1. A bilateration WSN with two anchors on one side.

– n is obviously an even integer; – Nodes are labeled from 1 to n in order; – Noise in range measurements is additive independent Gaussian with mean zero and standard deviation 1, denoted by ei (0 < i < 2n + 1); – The true location for node i is (xi , yi ) and the uncertainty or error in node  T i’s estimated location is Ui uxi uyi where superscript T denotes transposition. Specifically, U1 = U2 = 0. – The covariance matrix for coordinates of nodes 2i+1 and 2i+2 (0 < i < n/2) is a 4 × 4 matrix, denoted by Qi .   U2i+1 Qi = Cov U2i+2 At first, according to the structure of SBWSN, we can define – Two 2 × 2 matrices J1 and J2 :



J1 =  J2 =

1 0 cos α sin α



cos α − sin α 1 0



Quantitatively Investigating Multihop Localization Errors

479

– Two 2 × 4 matrices K1 and K2 :   10 0 0 K1 = 0 0 cos α sin α   cos α − sin α 0 0 K2 = 0 0 10 Based on the linearized MLE adopted in [9], the localization errors in node 2i − 1 and 2i are     e4i−3 U2i−3 T −1 T T −1 T U2i−1 = (J1 J1 ) J1 + (J1 J1 ) J1 K1 (1) e4i−2 U2i−2     e4i−1 U2i−3 + (J2T J2 )−1 J2T K2 (2) U2i = (J2T J2 )−1 J2T e4i U2i−2 Furthermore, they can be simplified as     e4i−3 U2i−3 + J1−1 K1 U2i−1 = J1−1 e4i−2 U2i−2     e4i−1 U2i−3 + J2−1 K2 U2i = J2−1 e4i U2i−2

(3) (4)

Then, the covariance matrix is  Qi =

(J1T J1 )−1 0 0 (J2T J2 )−1



 +

J1−1 K1 J2−1 K2

= AAT + BQi−1 B T =

i−1 

B j AAT (B j )T



 Qi−1

J1−1 K1 J2−1 K2

T (5) (6) (7)

j=0

= Qi−1 + B i−1 AAT (B i−1 )T where Q0 = 0, B 0 is an identity matrix, and   −1 0 J1 – A= 0 J −1  −1 2  J1 K1 – B= J2−1 K2

(8)

480

B. Jia et al.

Regarding the matrices A and B, we can obtain following equations (i ≥ 0): ⎛ ⎞ 1 0 0 0 ⎜ −(2i + 1) cot α 0 (2i + 1) cot α 1 ⎟ ⎟ B 2i+1 = ⎜ (9) ⎝ 0 0 1 0⎠ −(2i + 1) cot α 1 (2i + 1) cot α 0 ⎛ ⎞ 1 0 0 0 ⎜ −2i cot α 1 2i cot α 0 ⎟ ⎟ B 2i = ⎜ (10) ⎝ 0 0 1 0⎠ −2i cot α 0 2i cot α 1 ⎛ 1 −(2i + 1) cot α ⎜ −(2i + 1) cot α (5 + 12i + 8i2 ) cot2 α + csc2 α 2i+1 T 2i+1 T B AA (B ) =⎜ ⎝ 0 2(i + 1) cot α −2(i + 1) cot α 4(i + 1)(2i + 1) cot2 α ⎞ 0 −2(i + 1) cot α ⎟ 2(i + 1) cot α 4(i + 1)(2i + 1) cot2 α ⎟ (11) ⎠ 1 (2i + 1) cot α 2 2 2 (2i + 1) cot α (5 + 12i + 8i ) cot α + csc α ⎛ 1 −(2i + 1) cot α ⎜ −(2i + 1) cot α (1 + 4i + 8i2 ) cot2 α + csc2 α 2i T 2i T B AA (B ) = ⎜ ⎝ 0 2i cot α −2i cot α 4i(2i + 1) cot2 α ⎞ 0 −2i cot α ⎟ 2i cot α 4i(2i + 1) cot2 α ⎟ (12) ⎠ 1 (2i + 1) cot α (2i + 1) cot α (1 + 4i + 8i2 ) cot2 α + csc2 α Since we are interested in the diagonal entries in Qi , we only investigate the diagonal entries in above resulting matrices. To differ even and odd cases, we consider Q2i and Q2i+1 respectively. (Q2i )11 = (Q2i )33 = 2i 16 2 (Q2i )22 = (Q2i )44 = ( i3 + i) cot2 α + 2i csc2 α 3 3 (Q2i+1 )11 = (Q2i+1 )33 = 2i + 1 16 14 (Q2i+1 )22 = (Q2i+1 )44 = ( i3 + 8i2 + i + 1) cot2 α + (2i + 1) csc2 α 3 3 The formulae can be unified as (Qi )11 = (Qi )33 = i 2 4 (Qi )22 = (Qi )44 = ( i3 + i) cot2 α + i 3 3 where 0 ≤ i < n/2. As such, the Mean Squared Error (MSE) can be formulated as 2 4 (13) M SE(Ui ) = ( i3 + i) cot2 α + 2i 3 3

Quantitatively Investigating Multihop Localization Errors

481

Evidently, the localization error measured by MSE is propagated at the speed of Θ(i3 ) where i denotes the hop count from a sensor to the anchors in this regular scenario. 2.2

Placing Anchors on both Sides of the SBWSN

Suppose that another pair of anchors are placed at the right-most side of the bilateration WSN as shown in Fig. 2. However, the localization procedure becomes complicated in comparison with the aforementioned SBWSN with the anchors only at the left-most side. It is straightforward that a centralized localization algorithm will be preferred because the information can be sufficiently used, but the centralized implementation suffers from communication overheads and time delay, especially in large-scale wireless sensor networks. Therefore, we adopt a simple approach by independently performing two localization procedures at first, each of which is initialized by the pair of anchors at one side of the bilateration network and then fuse the two location estimates at each sensor through the weighted average algorithm to produce the final location estimate.

Fig. 2. A SBWSN with four anchors.

Specifically, given node i with i being odd, its location estimate from the localization procedure initialized by the left-most anchors is Ui , and evidently, the location estimate by the right-most anchors equals to Un−i (Un+2−i provided i is even). Then, the final location estimate can be formulated as Vi = wi Ui + (1 − wi )Un−i

(14)

where wi is the weight. Then, the MSE is M SE(Vi ) = wi2 M SE(Ui ) + (1 − wi )2 M SE(Un−i )

(15)

It is noticeable that different weights will result in dramatically different error characteristics. In order to efficiently fuse two location estimates, the more

482

B. Jia et al.

accurate is a candidate location estimate, the larger is its weight. Therefore, we let wi be M SE(Un−i )/(M SE(Ui ) + M SE(Un−i )), and then M SE(Vi ) =

M SE(Ui ) M SE(Ui )M SE(Un−i ) = SE(Ui ) M SE(Ui ) + M SE(Un−i ) 1 + MMSE(U n−i )

(16)

Obviously, the MSE of the fusion location estimate is smaller than those of both separate location estimates. Consequently, the speed of error propagation will decrease as well. 2.3

Simulation Results

We conduct simulations to validate the results about the error propagation with respect to different hop counts using two anchors and four anchors in the bilat3π π eration WSN respectively, with α being one of 5π 16 , 8 and 4 . The simulation results are plotted in Fig. 3. As can be seen, the angle α has series impact on error propagation in both cases, and the smaller is the angle, the faster is the error propagated. Moreover, placing anchors on both sides dramatically reduces localization errors in comparison with placing anchors on one side.

Fig. 3. The MSE with respect to different hops in the SBWSN.

3 3.1

A General Bilateration WSN Placing Anchors on One Side of the GBWSN

More generally, we generalize the angles by appointing α and β in the bilateration WSN as a GBWSN, which is illustrated in Fig. 4. The left-most two nodes are anchors, and all others are sensors and edges denote range measurement between two nodes. In the network, nodes with odd labels are located in the same horizontal straight line and so are nodes with even labels; nodes 1, 2 are

Quantitatively Investigating Multihop Localization Errors

483

Fig. 4. A GBWSN with two anchors.

not required in the same vertical straight line and so forth, with different angles α and β. The localization errors can be formulated as (Qi )11 = (Qi )33 = i 1 1 (Qi )22 + (Qi )44 = ( i3 − i)(cot α + cot β)2 + 2i(1 + cot2 α + cot2 β). 3 3 3.2

Simulation Results

We conduct an experiment to analyze the location error with respect to different hops using two anchors and four anchors in the general bilateration network π 3π π 3π 5π respectively, whenα and β are set as ( 5π 16 , 4 ), ( 8 and 4 ), and ( 8 , 16 ). The result is shown in Fig. 5

Fig. 5. The MSE with respect to different hops in the GBWSN.

484

4 4.1

B. Jia et al.

A regular 2-D WSN consisting of multiple bilateration WSNs The Problem Model

Supposing a regular 2-D network as illustrated in Fig. 6, mn nodes are placed at its mn corners, and the edge between any pair of nodes denotes a range measurement with independent additive Gaussian noise, N (0, σ 2 ).

Fig. 6. A grid.

4.2

The Rate of Error Propagation

The Fisher Information Matrix (FIM) of this sensor network, denoted J, is a 2mn × 2mn square matrix and can be formulated as ⎛  ⎞ A B 0 ⎜B AB ⎟ ⎜ ⎟ ⎜ 0 BA ⎟ ⎜ ⎟ 1 ⎜ ⎟ . .. J= 2⎜ ⎟ ⎟ σ ⎜ ⎜ AB 0 ⎟ ⎜ ⎟ ⎝ BA B⎠ 0 B A

Quantitatively Investigating Multihop Localization Errors

485

where A, B, A are 2 m × 2 m matrices: ⎞ ⎛ 1 0 −1 0 0 0 ⎟ ⎜ 0 2 0 0 0 0 ⎟ ⎜ ⎟ ⎜ −1 0 2 0 −1 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 2 0 0 ⎟ ⎜ ⎟ ⎜ 0 0 −1 0 2 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 0 0 2 A=⎜ ⎟ ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ 2 0 −1 0 ⎟ ⎜ ⎟ ⎜ 0 2 0 0 ⎟ ⎜ ⎝ −1 0 1 0 ⎠ 0 0 0 2 ⎞ ⎛ 1 0 −1 0 0 0 ⎟ ⎜ 0 1 0 0 0 0 ⎟ ⎜ ⎟ ⎜ −1 0 2 0 −1 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 1 0 0 ⎟ ⎜ ⎟ ⎜ 0 0 −1 0 2 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 0 0 1  A =⎜ ⎟ ⎟ ⎜ . .. ⎟ ⎜ ⎟ ⎜ ⎜ 2 0 −1 0 ⎟ ⎟ ⎜ ⎜ 0 1 0 0⎟ ⎟ ⎜ ⎝ −1 0 1 0 ⎠ 0 0 0 1 ⎞ ⎛ 0 0 0 0 0 0 ⎟ ⎜ 0 −1 0 0 0 0 ⎟ ⎜ ⎟ ⎜0 0 0 0 0 0 ⎟ ⎜ ⎟ ⎜ 0 0 0 −1 0 0 ⎟ ⎜ ⎟ ⎜0 0 0 0 0 0 B=⎜ ⎟ ⎟ ⎜ 0 0 0 0 0 −1 ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ ⎝ 0 0 ⎠ 0 −1 Obviously, J is an n × n symmetric tridiagonal block matrix with block size 2m×2m. Each block matrix is associated with one row in the network. Moreover, A and A are also symmetric tridiagonal block matrices with block size 2 × 2; while B is a diagonal matrix. In J, each node is relevant to two adjacent rows and columns. If we want some nodes to be anchors, we just need to eliminate columns and rows associated with these nodes. For example, if all of the bottom and top rows are anchors, we obtain:

486

B. Jia et al.



⎞ A BT 0 ⎜ B A BT ⎟ ⎜ ⎟ ⎜ ⎟ 1 ⎜0 B A ⎟  J = 2⎜