Session Open

Conference Opening

Conference
9:00 AM — 9:10 AM HKT
Local
Dec 1 Tue, 8:00 PM — 8:10 PM EST

Session Chair

Song Guo (The Hong Kong Polytechnic University)

Session Keynote-1

Keynote 1

Conference
9:10 AM — 9:55 AM HKT
Local
Dec 1 Tue, 8:10 PM — 8:55 PM EST

Resource Allocation and Consensus in Edge Blockchain Systems

Yuanyuan Yang (SUNY Distinguished Professor, Stony Brook University, USA)

6
Edge devices with sensing, storage, and communication resources are penetrating our daily lives. Such edge devices can be used to conduct data transactions, e.g., micro-payments and micro-access control. The blockchain technology can be used to ensure the transactions unmodifiable and undeniable. In this talk, we present a blockchain system that adapts to edge devices. The new blockchain system can fairly and efficiently allocate storage resources on edge devices and achieve high scalability. We find the optimal peer nodes for transaction data storage in the blockchain, and provide the Recent Block Storage Allocation Scheme for quick retrieval of missing blocks. This blockchain system can also reach mining consensus with low energy consumption in edge devices with a new Proof of Stake mechanism. Extensive simulations show that the blockchain system works efficiently in edge environments. On average, the new system uses 15% less time and consumes 64% less battery power comparing with traditional blockchain systems.

Session Chair

Song Guo (The Hong Kong Polytechnic University)

Session Keynote-2

Keynote 2

Conference
9:55 AM — 10:40 AM HKT
Local
Dec 1 Tue, 8:55 PM — 9:40 PM EST

On Optimal Partitioning and Scheduling of DNNs in Mobile Cloud Computing

Jie Wu (Center for Networked Computing, Temple University, USA)

5
As Deep Neural Networks (DNNs) have been widely used in various applications, including computer vision on image segmentation and recognition, it is important to reduce the makespan of DNN computation, especially when running on mobile devices. Offloading is a viable solution that offloads computation from a slow mobile device to a fast, but remote server in cloud. As DNN computation consists of a multiple-stage processing pipeline, it is critical to decide on what stage should offloading occur to minimize the makespan. Our observations show that the local computation time on a mobile device follows a linear increasing function, while the offloading time on a mobile device is monotonic decreasing and follows a convex curve as more DNN layers are computed in the mobile device. Based on this observation, we first study the optimal partition and scheduling for one line-structure DNN. Then, we extend the result to multiple line-structure DNNs. Heuristic results for general-structure DNNs, represented by Directed Acyclic Graphs (DAGs), are also discussed based on a path-based scheduling policy. Our proposed solutions are validated via real system implementation.

Session Chair

Jiannong Cao (The Hong Kong Polytechnic University)

Session Break-1

Virtual Break

Conference
10:40 AM — 10:50 AM HKT
Local
Dec 1 Tue, 9:40 PM — 9:50 PM EST

Session Chair

N/A

Session A1

Learning Algorithm

Conference
10:50 AM — 11:50 AM HKT
Local
Dec 1 Tue, 9:50 PM — 10:50 PM EST

LTG-LSM: The Optimal Structure in LSM-tree Combined with Reading Hotness

Jiaping Yu, Huahui Chen, Jiangbo Qian and Yihong Dong

1
A growing number of KV storage systems have adopted the Log-Structured-Merge-tree (LSM-tree) due to its excellent write performance. However, the high write amplification in the LSM-tree has always been a difficult problem to solve. The reason is that the design of traditional LSM-tree under-utilizes the data distribution of query, and the design space does not take into account the read and write performance concurrently. As a result, we may sacrifice one to improve another performance. When advancing the writing performance of the LSM-tree, we can only conservatively select the design pattern in the design space to reduce the impact on reading throughputs, resulting in limited improvement. Aiming at the shortcomings of existing methods, a new LSM-tree structure (Leveling-Tiering-Grouped-LSM-tree, LTG-LSM) is proposed by us that combined with reading hotness. The LTGLSM structure maintains hotness prediction models at each level of the LSM-tree. The structure of the newly generated disk components is determined by the predicted hotness. Finally, a specific compaction algorithm is carried out to handle the compaction between the different structural components and processing workflow hotness changes. Experiments show that the scheme proposed by this paper significantly reduces the write amplification (up to about 71%) of the original LSM-tree with almost no sacrificing reading performance and improves the write throughputs (up to about 24%) in workflows with different configurations.

Improved MapReduce Load Balancing through Distribution-Dependent Hash Function Optimization

Zafar Ahmad, Sharmila Duppala, Rezaul Chowdhury and Steven Skiena

0
Load balancing of skewed data in MapReduce systems like Hadoop is a well-studied problem. Many heuristics already exist to improve the load balance of the reducers thereby reducing the overall execution time. In this paper, we propose a lightweight optimization approach for MapReduce systems to minimize the makespan for repetitive tasks involving a typical frequency distribution.
Our idea is to analyze the observed frequency distribution for the given task so as to identify an optimal offset parameter c to add in the hash function to minimize makespan. For two different bucketing methods �C modulo labeling and consecutive binning �C we present efficient algorithms for finding the optimal value of c. Finally, we present simulation results for both bucketing methods. The results vary with the data distribution and the number of reducers, but generally reduce makespan by 20% on average for power-law distributions, Results are confirmed with experiments on well-known real-world data sets.

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Shaohuai Shi, Qiang Wang and Xiaowen Chu

1
Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8�� speedup over Nvidia��s library cuSPARSE in many matrices.

Session Chair

Gongpu Chen (Chinese University of Hong Kong)

Session A2

Algorithms for Cloud Systems

Conference
10:50 AM — 12:10 PM HKT
Local
Dec 1 Tue, 9:50 PM — 11:10 PM EST

Joint Service Placement and Request Scheduling for Multi-SP Mobile Edge Computing Network

Zhengwei Lei, Hongli Xu, Liusheng Huang and Zeyu Meng

1
Mobile edge computing(MEC), as an emerging computing paradigm, pushes services away from centralized remote cloud to distributed edge servers deployed by multiple service providers(SPs), improving user experience and reducing the communication burden on core network. However, this distributed computing architecture also brings some new challenges to the network. In multi-SP MEC system, a SP prefers to use edge servers deployed by itself instead of others, which not only improves service quality but also reduces processing cost. The service placement and request scheduling strategies directly affect the revenue of SPs. Since the service popularity changes over time and the resources of edge servers are limited, the network system needs to make decisions about service placement and request scheduling dynamically to provide better service for users. Owing to the lack of long-term prior knowledge and involving binary decision variables, how to place services and schedule requests to boost the profit of SPs is a challenging problem. We formally formalize this joint optimization problem and propose an efficient online algorithm. First, we invoke Lyapunov optimization technology to convert the long-term optimization problem into a series of subproblems, then a dual-decomposition algorithm is utilized to solve the subproblem. Experimental results show that the algorithm proposed in this paper achieves nearly optimal performance, and it raises 25% and 70% profit compared to greedy and Top-K algorithms, respectively.

A similarity clustering-based deduplication strategy in cloud storage systems

Saiqin Long, Zhetao Li, Zihao Liu, Qingyong Deng, Sangyoon Oh and Nobuyoshi Komuro

0
Deduplication is a data redundancy elimination technique, designed to save system storage resources by reducing redundant data in cloud storage systems. With the development of cloud computing technology, deduplication has been increasingly applied to cloud data centers. However, traditional technologies face great challenges in big data deduplication to properly weigh the two conflicting goals of deduplication throughput and high duplicate elimination ratio. This paper proposes a similarity clustering-based deduplication strategy (named SCDS), which aims to delete more duplicate data without significantly increasing system overhead. The main idea of SCDS is to narrow the query range of fingerprint index by data partitioning and similarity clustering algorithms. In the data preprocessing stage, SCDS uses data partitioning algorithm to classify similar data together. In the data deletion stage, the similarity clustering algorithm is used to divide the similar data fingerprint superblock into the same cluster. Repetitive fingerprints are detected in the same cluster to speed up the retrieval of duplicate fingerprints. Experiments show that the deduplication ratio of SCDS is better than some existing similarity deduplication algorithms, but the overhead is only slightly higher than some high throughput but low deduplication ratio methods.

A Fair Task Assignment Strategy for Minimizing Cost in Mobile Crowdsensing

Yujun Liu, Yongjian Yang, En Wang, Wenbin Liu, Dongming Luan, Xiaoying Sun and Jie Wu

3
Mobile CrowdSensing (MCS) is a promising paradigm that recruits mobile users to cooperatively perform various sensing tasks. When assigning tasks to users, most existing works only consider the fairness of users, i.e., the user��s processing ability, with the goal of minimizing the assignment cost. However, in this paper, we argue that it is necessary to not only give full use of all the users�� ability to process the tasks (e.g., not exceeding the maximum capacity of each user while also not letting any user idle too long), but also satisfy the assignment frequency of all corresponding tasks (e.g., how many times each task should be assigned within the whole system time) to ensure a long-term, double-fair and stable participatory sensing system. Hence, to solve the task assignment problem which aims to reasonably assign tasks to users with limited task processing ability while ensuring the assignment frequency, we first model the two fairness constraints simultaneously by converting them to user processing queues and task virtual queues, respectively. Then, we propose a Fair Task Assignment Strategy (FTAS) utilizing Lyapunov optimization and we provide the proof of the optimality for the proposed assignment strategy to ensure that there is an upper bound to the total assignment cost and queue backlog. Finally, extensive simulations have been conducted over three real-life mobility traces: Changchun/taxi, Epfl/mobility, and Feeder. The simulation results prove that the proposed strategy can achieve a trade-off between the objective of minimizing the cost and the fairness of tasks and users compared with other baseline approaches.

Communication-Aware Load Balancing of the LU Factorization over Heterogeneous Clusters

Lucas Leandro Nesi, Lucas Mello Schnorr and Arnaud Legrand

2
Supercomputers are designed to be as homogeneous as possible but it is common that a few nodes exhibit variable performance capabilities due to processor manufacturing. It is also common to find partitions equipped with different types of accelerators. Data distribution over heterogeneous nodes is very challenging but essential to exploit all resources efficiently. In this article, we build upon task-based runtimes�� flexibility of managing data to study the interplay between static communicationaware data distribution strategies and dynamic scheduling of the linear algebra LU factorization over heterogeneous sets of hybrid nodes. We propose two techniques derived from the state-of-the-art 1D��1D data distributions. First, to use fewer computing nodes towards the end to better match performance bounds and save computing power. Second, to carefully move a few blocks between nodes to optimize even further the load balancing among nodes. We also demonstrate how 1D��1D data distributions, tailored for heterogeneous nodes, can scale better with homogeneous clusters than classical block-cyclic distributions. Validation is carried out both in real and in simulated environments under homogeneous and heterogeneous platforms, demonstrating compelling performance improvements.

Session Chair

Lei Mo (Southeast University)

Session A3

Algorithms for Networks

Conference
10:50 AM — 12:10 PM HKT
Local
Dec 1 Tue, 9:50 PM — 11:10 PM EST

An Efficient Work-Stealing Scheduler for Task Dependency Graph

Chun-Xun Lin, Tsung-Wei Huang and Martin D. F. Wong

1
Work-stealing is a key component of many parallel task graph libraries such as Intel Threading Building Blocks (TBB) FlowGraph, Microsoft Task Parallel Library (TPL) Batch.Net, Cpp-Taskflow, and Nabbit. However, designing a correct and effective work-stealing scheduler is a notoriously difficult job, due to subtle implementation details of concurrency controls and decentralized coordination between threads. This problem becomes even more challenging when striving for optimal thread usage in handling parallel workloads with complex task graphs. As a result, we introduce in this paper an effective work-stealing scheduler for execution of task dependency graphs. Our scheduler adopts a simple and efficient strategy to adapt the number of working threads to available task parallelism at any time during the graph execution. Our strategy is provably good in preventing resource underutilization and simultaneously minimizing resource waste when tasks are scarce.We have evaluated our scheduler on both micro-benchmarks and a real-world circuit timing analysis workload, and demonstrated promising results over existing methods in terms of runtime, energy efficiency, and throughput.

LBNN: Perceiving the State Changes of a Core Telecommunications Network via Linear Bayesian Neural Network

Yanying Lin, Kejiang Ye, Ming Chen, Naitian Deng, Tailin Wu, and Cheng-Zhong Xu

1
The core network is the most basic facility in the entire telecommunications network, which is consists of large number of routers, switches and firewalls. Network management like re-planning routes or adjusting policies is very important to avoid failures. However, the timing of intervention is very challenging. Too early intervention will incur unnecessary overheads, and too late intervention will cause serious disaster. In this paper, we analyzed a large data set from a real-world core telecommunications network and proposed Linear Bayesian Neural Networks (LBNN) 1 to perceive the core network state changes and make decisions about network intervention. In particular, we considered three aspects of complexity, including the weight of the mutual effect between devices, the dependence on the time dimension of the network states, and the randomness of the network state changes. The entire model is extended to a probability model based on the Bayesian framework to better capture the randomness and variability of the data. Experimental results on real-world data set show that LBNN achieves very high detection accuracy, with an average of 92.1%.

A Method to Detecting Artifact Anomalies in A Microservice Architecture

Faisal Fahmi, Pei-Shu Huang and Feng-Jian Wang

1
The microservice architecture is a Service-Oriented Architecture (SOA) where a service-based application can be composed of a number of smaller but independently concurrent running units, called microservices, to improve the performance and maintainability of the application. In an application with microservice architecture, an unexpected artifact state(s) inside a microservice may be exchanged to another microservice or other service units and corrupt the whole system of the application. On the other hand, the abnormal artifact operation pairs can be categorized into continuous and concurrent artifact anomalies which indicate that two sequential and parallel operations working on the same artifact resulting in abnormal behavior semantically. The recent studies showed that an SP-tree structure adopted in the detections of both anomalies inside a structured workflow can reduce the computation complexity of detection as linear. In this paper, we present a series of methods based on SP-tree to detect the artifact anomalies inside each microservice of the application during the design phase. Different from the design of applications with traditional services, where each service is assumed to contain all possible types of anomalies and cannot be modified directly, the designer of microservice is concerned with the limited scope and can modify each microservice based on the anomaly detection results. Our contribution includes identifying the artifact properties in a microservice architecture and the methods to detect the anomalies based on these properties which can simplify the detection of artifact anomaly in service-based applications.

Contention resolution on a restrained channel

Elijah Hradovich, Marek Klonowski and Dariusz R. Kowalski

1
We examine deterministic contention resolution on a multiple-access channel when packets are injected continuously by an adversary to the buffers of n available stations in the system, arbitrarily at rate at most �� packets per round. The aim is to successfully transmit packets and maintain system stability, that is, bounded queues, even in infinite executions. The largest injection rate for which a given contention resolution algorithm guaranties stability is called (algorithm��s) throughput. In contrast to the previous work, we consider a channel in which there is a strict limit k on the total number of stations allowed to transmit or listen to the channel at a given time, that can never be exceeded; we call such channel a k-restrained channel.
We construct adaptive and full sensing protocols with optimal throughput 1 and almost optimal throughput 1?1/n, respectively, in a constant-restrained channel. By contrast, we show that restricted protocols based on schedules known in advance obtain throughput at most min.
We also support our theoretical analysis by simulation results of our algorithms in systems of moderate, realistic sizes and scenarios, and compare them with popular backoff protocols.

Session Chair

Fei Tong (Southeast University)

Session Lunch-Break-1

Virtual Lunch Break

Conference
12:10 PM — 1:30 PM HKT
Local
Dec 1 Tue, 11:10 PM — 12:30 AM EST

Session Chair

N/A

Session B1

Performance Prediction and Optimization

Conference
1:30 PM — 2:30 PM HKT
Local
Dec 2 Wed, 12:30 AM — 1:30 AM EST

Real-Time Scheduling and Analysis of OpenMP Programs with Spin Locks

He Du, Xu Jiang, Tao Yang, Mingsong Lv and Wang Yi

0
Locking protocol is an essential component in resource management of real-time systems, which coordinates mutually exclusive accesses to shared resources from different tasks. OpenMP is a promising framework for multi-core realtime embedded systems as well as provides spin locks to protect shared resources. In this paper, we propose a resource model for analyzing OpenMP programs with spin locks. Based on our resource model, we also develop a technique for analyzing the blocking time which impacts the total workload. Notably, the resource model provides detailed resource access behavior of the programs, making our blocking analysis more accurate. Further, we derive the schedulability analysis for real-time OpenMP tasks with spin locks protecting shared resources. Experiments with realistic OpenMP programs are conducted to evaluate the performance of our method.

Predicting Performance Degradation on Adaptive Cache Replacement Policy

Yi Zhang, Ran Cui, Mingsong Lv, Chuanwen Li and Qingxu Deng

0
Adaptive Cache Replacement Policy (ACRP) has been implemented in recently proposed commercial multi-core processors. ACRP consists of two candidate cache replacement policies and dynamically employs the policy which is with fewer cache misses at the moment. ACRP can diminish the overall cache misses, but at the same time it augments the performance inference between co-running applications and makes the performance prediction much harder. Unfortunately, very little work has focused on the performance impact from this mechanism. In this paper, we firstly expose the performance variation problem due to adaptive cache replacement policies. Secondly, we present Bubble-Bound, a low-overhead measurement-based method to estimate a program��s performance variation caused by the dynamic adaptation of cache replacement policies. By using a stress program to characterize the pressure and sensitivity, our method can predict a bound for the performance degradation between co-located applications and enable ��safe�� co-locations on the processors with ACRP.

Making Inconsistent Components More Efficient For Hybrid B+Trees

Xiongxiong She, Chengliang Wang and Fenghua Tu

1
The emergence of non-volatile memories (NVMs) provides opportunities for efficiently manipulating and storing tree-based indexes. Traditional tree structures fail to take full advantage of NVMs due to the considerable shifts of entries. Those redundant write activities induce severe performance collapse and power consumption, which is unacceptable for embedded systems. Advanced schemes, such as NV-Tree, remain only leaf nodes in NVM to lighten the burden of writes. However, NV-Tree suffers from frequency and in-DRAM structures. In this paper, we develop a novel tree structure, Marionette- Tree, to address these issues. We follow the hybrid layout of non-leaf/leaf nodes and adopt bitmap-based leaf nodes for minimum NVM writes. We aggregate the pointers of internal nodes into a shadow array, allowing recording more keys in an internal node. We also design a delayed-split-migration scheme to minimize the management overhead of the shadow array. Extensive evaluations demonstrate that Marionette-Tree can achieve 1.38 �� and 5.73 �� insertion speedup over two stateof- the-arts.

Session Chair

Nan Guan (The Hong Kong Polytechnic University)

Session B2

Edge and Persistent Memory

Conference
1:30 PM — 2:30 PM HKT
Local
Dec 2 Wed, 12:30 AM — 1:30 AM EST

XOR-Net: An Efficient Computation Pipeline for Binary Neural Network Inference on Edge Devices

Shien Zhu, Luan H. K. Duong, and Weichen Liu

1
Accelerating the inference of Convolution Neural Networks (CNNs) on edge devices is essential due to the small memory size and poor computation capability of these devices. Network quantization methods such as XNOR-Net, Bi-Real- Net, and XNOR-Net++ reduce the memory usage of CNNs by binarizing the CNNs. They also simplify the multiplication operations to bit-wise operations and obtain good speedup on edge devices. However, there are hidden redundancies in the computation pipeline of these methods, constraining the speedup of those binarized CNNs.
In this paper, we propose XOR-Net as an optimized computation pipeline for binary networks both without and with scaling factors. As XNOR is realized by two instructions XOR and NOT on CPU/GPU platforms, XOR-Net avoids NOT operations by using XOR instead of XNOR, thus reduces bit-wise operations in both aforementioned kinds of binary convolution layers. For the binary convolution with scaling factors, our XOR-Net further rearranges the computation sequence of calculating and multiplying the scaling factors to reduce fullprecision operations. Theoretical analysis shows that XOR-Net reduces one-third of the bit-wise operations compared with traditional binary convolution, and up to 40% of the fullprecision operations compared with XNOR-Net. Experimental results show that our XOR-Net binary convolution without scaling factors achieves up to 135�� speedup and consumes no more than 0.8% energy compared with parallel full-precision convolution. For the binary convolution with scaling factors, XOR-Net is up to 17% faster and 19% more energy-efficient than XNOR-Net.

Load Balance Awared Data Sharing Systems In Heterogeneous Edge Environment

Sheng Chen, Zheng Chen, Siyuan Gu, Baochao Chen, Junjie Xie and Deke Guo

1
Edge computing has become the de facto method for delay-sensitive applications, in which the computation and storage resources are placed at the edge of network. The main responsibility of edge computing is to carry data from the Cloud downlinks and terminal uplinks, and organize these data well on the edge side. This is the basis for subsequent analysis and processing of data. Therefore, this brings about a question as to how those data should be organized on the edge and how to store and retrieve them. In response to this demand, some methods have been proposed to solve the related problems of how to build data storage and retrieval services on the edge side. Those methods propose three different solutions: structured, unstructured, and hybrid schemes. However, the data storage and retrieval services for the heterogeneous edge environment is still lack of research. It is still not considered an important design test load balancing when the data is stored on the edge side. In this paper, we design and implement w-strategy, a load balance approach that implements the appropriate load balance among the heterogeneous edge nodes by using the weighted Voronoi diagram. Our solution utilizes the software defined networking paradigm to support a virtual-space based distributed hash tables (DHTs) to distribute data. Evaluation results show that wstrategy achieves better load balancing among the heterogeneous edge nodes compared to the existing methods, GRED and Chord. And, the w-strategy improves the average underutilization of the resources by 20%.

Themis: Malicious Wear Detection and Defense for Persistent Memory File Systems

Wenbin Wang, Chaoshu Yang, Runyu Zhang, Shun Nie, Xianzhang Chen and Duo Liu

2
The persistent memory file systems can significantly improve the performance by utilizing the advanced features of emerging Persistent Memories (PMs). Unfortunately, the PMs have the problem of limited write endurance. However, the design of persistent memory file systems usually ignores this problem. Accordingly, the write-intensive applications, especially for the malicious wear attack virus, can damage underlying PMs quickly by calling the common interfaces of persistent memory file systems to write a few cells of PM continuously. Which seriously threat to the data reliability of file systems. However, existing solutions to solve this problem based on persistent memory file systems are not systematic and ignore the unlimited write endurance of DRAM. In this paper, we propose a malicious wear detection and defense mechanism for persistent memory file systems, called Themis, to solve this problem. The proposed Themis identifies the malicious wear attack according to the write traffic and the set lifespan of PM. Then, we design a wear-leveling scheme and migrate the writes of malicious wear attackers into DRAM to improve the lifespan of PMs. We implement the proposed Themis in Linux kernel based on NOVA, a state-of-the-art persistent memory file system. Compared with DWARM, the stateof- the-art and wear-aware memory management technique, experimental results show that Themis can improve 5774�� lifetime of PM and 1.13�� performance, respectively.

Session Chair

Mingsong Lv (Northeastern University)

Session C1

Federated Learning and Reinforcement Learning

Conference
1:30 PM — 2:50 PM HKT
Local
Dec 2 Wed, 12:30 AM — 1:50 AM EST

Incentive Mechanism Design for Federated Learning: A Two-stage Stackelberg Game Approach

Guiliang Xiao, Mingjun Xiao, Guoju Gao, Sheng Zhang, Hui Zhao and Xiang Zou

0
Federated Learning (FL) is a newly-emerging distributed ML model, where a server can coordinate multiple workers to cooperatively train a learning model by using their private datasets, while ensuring these datasets not to be revealed to others. In this paper, we focus on the incentive mechanism design for FL systems. Taking the incentives into consideration, we first design two utility functions for the server and workers, respectively. Then, we model the corresponding utility optimization problem as a two-stage Stackelberg game by seeing the server as a leader and the workers as some followers. Next, we derive an optimal Equilibrium solution for the both stages of the whole game. Based on this solution, we design an incentive mechanism that can ensure the server to achieve the optimal utility, while stimulating workers to do their best to train the ML model. Finally, we conduct extensive simulations to demonstrate the significant performance of the proposed mechanism.

Time Efficient Federated Learning with Semi-asynchronous Communication

Jiangshan Hao, Yanchao Zhao and Jiale Zhang

0
With the explosive growth of massive data generated by smart Internet of Things (IoT) devices, federated learning has been envisioned as a promising technique to provide distributed machine learning services while protecting training data privacy. However, conventional federated learning protocols have shown significant drawbacks in regards of efficiency and scalability. First, since the synchronous communication model of federated learning and the computation capability of each device is different, the straggled users could severely desegregate the efficiency. Second, in synchronous communication, there is no effective client selection mechanism to make the model perform better in the early stage. Third, how to coordinate the communication of various nodes to accelerate global convergence is also one of the issues that need to be considered. To solve the above-mentioned problems, we propose a semi-asynchronous federated learning mechanism where a data expansion method is used to effectively reduce the stragglers existing in both synchronous and asynchronous communication models. Moreover, we also designed a priority function to make the accuracy increase rapidly in the early stage. Experimental results demonstrate that our proposed method have higher accuracy and faster

Multi-agent Fault-tolerant Reinforcement Learning with Noisy Environments

Canhui Luo, Xuan Liu, Xinning Chen and Juan Luo

0
Multi-agent reinforcement learning system is used to solve the problem that agents achieve specific goals in the interaction with the environment through learning policies. Almost all existing multi-agent reinforcement learning methods assume that the observation of the agents is accurate during the training process. It does not take into account that the observation may be wrong due to the complexity of the actual environment or the existence of dishonest agents, which will make the agent training difficult to succeed. In this paper, considering the limitations of the traditional multi-agent algorithm framework in noisy environments, we propose a multi-agent fault-tolerant reinforcement learning (MAFTRL) algorithm. Our main idea is to establish the agent��s own error detection mechanism and design the information communication medium between agents. The error detection mechanism is based on the autoencoder, which calculates the credibility of each agent��s observation and effectively reduces the environmental noise. The communication medium based on the attention mechanism can significantly improve the ability of agents to extract effective information. Experimental results show that our approach accurately detects the error observation of the agent, which has good performance and strong robustness in both the traditional reliable environment and the noisy environment. Moreover, MAFTRL significantly outperforms the traditional methods in the noisy environment.

Decentralized Exploration of a Structured Environment Based on Multi-agent Deep Reinforcement Learning

Dingjie He, Dawei Feng, Hongda Jia and Hui Liu

0
Multi-robot environment exploration is one of the widely discussed topics in the field of robotics. It is the foundation for many real-world robotic applications. Many decentralized methods (that is, without a centralized controller) have been proposed in the past decades. Most of them focus on improving collaboration efficiency by utilizing low-level heuristic information, such as distances to obstacles and robot positions. In contrast, although a human being can make decisions on a similar task, he/she exploits highlevel knowledge, such as the building��s common structure pattern. This paper proposes a novel distributed multi-robot exploration algorithm based on deep reinforcement learning (DME-DRL) for structured environments that enables robots to make decisions on the basis of this high-level knowledge. DMEDRL is a distributed algorithm that uses deep neural networks to extract the structural pattern of the environment, and it can work in scenarios with or without communication. The experimental results showed that this approach can decrease the travel distance by approximately 10.84% on average, compared with those of traditional heuristic methods and can significantly reduce the communication cost in the exploration process.

Session Chair

Dawei Feng (National University of Defense Technology)

Session C2

Crowd Sourcing and Parallel Acceleration

Conference
1:30 PM — 2:50 PM HKT
Local
Dec 2 Wed, 12:30 AM — 1:50 AM EST

D2D-Enabled Reliable Data Collection for Mobile Crowd Sensing

Pengfei Wang, Zhen Yu, Chi Lin, Leyou Yang, Yaqing Hou and Qiang Zhang

0
With increasing more powerful sensing capacities of mobile devices, the Mobile Crowd Sensing (MCS) system requires to collect larger sensing data from participants. Nevertheless, collecting such large volume of data will cost a lot for participants, base stations and MCS server. Even worse, some sensing data cannot satisfy the MCS sensing requirement due to the low quality and are filtered by the MCS server in clouds. Inspired by the D2D technique, where mobile devices can communicate directly with the help of the nearby base station, in 5G networks, we propose the Reliable Data Collection (RDC) algorithm to validate the generated sensing data at device sides in this paper. To be specific, the whole progress is formulated as a Probability problem of Discovering Reliable sensing data (PDR) at client sides, and Expectation Maximization (EM) is leveraged to devise the algorithm. Finally, the extensive simulations and real-world use case are conducted to evaluate the performance of RDC algorithm, and the result shows that RDC outperforms the other two benchmarks in estimating accuracy and saving data collection cost.

Improving the Applicability of Visual Peer-to-Peer Navigation with Crowdsourcing

Erqun Dong, Jianzhe Liang, Zeyu Wang, Jingao Xu, Longfei Shangguan, Qiang Ma and Zheng Yang

0
Visual peer-to-peer navigation is a suitable solution for indoor navigation for it relieves the labor of site-survey and eliminates infrastructure dependence. However, a major drawback hampers its application, as the peer-to-peer mode suffers from a deficiency of paths in large indoor scenarios with multifarious places-of-interest. Nevertheless, we propose one with a profound crowdsourcing scheme that addresses the drawback by merging the paths of different leaders�� into a global map. To realize the idea, we further deal with entailed challenges, namely the unidirectional disadvantage, the scale ambiguity, and large computational overhead. We design a navigation strategy to solve the unidirectional problem and turn to VIO to tackle scale ambiguity. We devise a mobileedge architecture to enable real-time navigation (30fps, 100ms end-to-end delay) and lighten the burden of smartphones (35% battery life for 2h35min) while assuring the accuracy of localization and map construction. Through experimental validations, we show that P2P navigation, previously relying on the abundance of independent paths, can enjoy a sufficiency of navigation paths with a crowdsourced global map. The experiments demonstrate a navigation success rate of 100% and spatial offset of less than 3.2m, better than existing works.

Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution

Wassapon Watanakeesuntorn, Keichi Takahashi, Kohei Ichikawa, Joseph Park, George Sugihara, Ryousei Takano, Jason Haga and Gerald M. Pao

0
Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of data collection capabilities, there is a great need to identify causal relationships in large datasets. We present mpEDM, a parallel distributed implementation of EDM optimized for modern GPU-centric supercomputers. We improve the original algorithm to reduce redundant computation and optimize the implementation to fully utilize hardware resources such as GPUs and SIMD units. As a use case, we run mpEDM on AI Bridging Cloud Infrastructure (ABCI) using datasets of an entire animal brain sampled at single neuron resolution to identify dynamical causation patterns across the brain. mpEDM is 1,530�� faster than cppEDM and a dataset containing 101,729 neuron was analyzed in 199 seconds on 512 nodes. This is the largest EDM causal inference achieved to date.

An Effective Design to Improve the Efficiency of DPUs on FPGA

Yutian Lei, Qingyong Deng, Saiqin Long, Shaohui Liu and Sangyoon Oh

0
Convolutional neural networks (CNNs) have been widely used in various complicated problems, such as image classification, objection detection, semantic segmentation. To meet diversified CNN structures, the deep learning processing unit (DPU) is designed as a general accelerator on field programmable gate array (FPGA) to support various CNN layers, such as convolution, pooling, activation, etc. However, low DPU utilization and schedule efficiency appear when DPU used to multitask application completed by CNN models. In this paper, an effective design including multi-core with different size (MCDS) and DPU Plus is proposed to improve the efficiency of DPUs usage from the two dimensions of time and space. Through increasing the number of DPU cores on an FPGA and the utilization of single DPU core, the design of MCDS can effectively improve the overall throughput with restricted on-chip resources. Furthermore, the design of DPU Plus is proposed to improve the schedule efficiency of DPUs through simultaneously implementing DPU with other significant auxiliary modules of the application system on the same FPGA. Finally, a color space conversion module is implemented cooperate to the DPU cores to testify its performance, and the experimen shows that compared with running on the the CPU completely, it achieves16.2x acceleration, and increases the throughput of the entire system by 3.0x.

Session Chair

Shigeng Zhang (Central South University)

Session C3

Localization and Cross-technology Communication

Conference
2:50 PM — 4:10 PM HKT
Local
Dec 2 Wed, 1:50 AM — 3:10 AM EST

Lightweight Mobile Devices Indoor Location Based on Image Database

Ran Gao, Yanchao Zhao and Maoxing Tang

0
Among the numerous indoor localization technologies, image-based solution has great advantages on convenient access from smartphone and its infrastructure-less deployment. However, the image-based localization also suffers from two key disadvantages, which hinders the universal application. Firstly, it requires a large amount of computing and storage resources, which is difficult to achieve for the mobile device, while cloudbased scheme incurs unacceptable delay. Secondly, although this solution doesn��t require infrastructure, it still suffers from labor intensive image-database construction and updates. To overcome these limitations, we propose an image-based indoor localization method featured with realtime localization and labor-less image database update. This method mainly innovates in two aspects. First, we propose a mobile device compatible image database compressing framework, which enable realtime and accurate ondevice image searching even in a large scenario. Our localization method achieves resource efficiency (in terms of storage and processing) by only keeping image feature vectors, and employing the efficient k-mean tree to search for the best matched image. Secondly, to achieve labor-less image database updating, we mainly add high-quality and informative query image into the database. These query image could compensate the missing information or changed scenario in a up-to-date manner. We conduct real experiments in Android Platform to verify the feasibility and performance of the localization method. Experiment results show that our method has good accuracy (90% location errors are within 1.5m) and high real-time performance (average location delay is less than 0.5s).

A Dynamic Escape Route Planning Method for Indoor Multi-floor Buildings Based on Real-time Fire Situation Awareness

Chun Wang, Juan Luo, Cuijun Zhang and Xuan Liu

0
The complicated interior structure of the highrise buildings brings great difficulties for fire escape routes planning. Existing two-dimensional (2D) emergency evacuation models are utilized to solve the problem of guidance and rescue for fire responders. However, these models are faced with a bottleneck of low security due to limited environmental information and no consideration of trapped personnel behavior features. In this paper, we propose DERP, a dynamic escape route planning method that achieves accurate disaster site avoidance and safety route planning considering fire situation awareness in a smart building. DERP is enabled by two novel designs. First, a three-dimensional (3D) fire information model is constructed by cellular automata considering the overall situation of indoor 3D topological structure, fire situation and crowd distribution. Second, a multiple constraints 3D indoor emergency escape route planning algorithm is designed based on a 3D path safety function. The experimental results show that DERP can plan and adjust the escape route timely and dynamically, thus increasing the escape probability of the trapped people.

Mitigating Cross-Technology Interference in Heterogeneous Wireless Networks based on Deep Learning

Weidong Zheng, Junmei Yao and Kaishun Wu

0
With the prosperity of Internet of Things, a large number of heterogeneous wireless devices share the same unlicensed spectrum, leading to severe cross-technology interference (CTI). Especially, the transmission power asymmetry of heterogeneous devices will further deteriorate this problem, making the low-power dev ices prohibited from data transmission and starved. This paper proposes an enhanced CCA (E-CCA) mechanism to mitigate CTI, so as to improve the performance and fairness among heterogeneous networks. E-CCA contains a signal identification design based on deep learning to identify the signal type within a tolerable time duration, it also contains a CCA adaptive mechanism based on the signal type to avoid CTI. As a result, the ZigBee devices could compete for the channel with WiFi devices more fairly, and the network performance can be improved accordingly. We set up a testbed based on TelosB, a commercial ZigBee platform, and USRP N210, which can be used as the WiFi platform. With the collected signals through USRP N210, over 99.9% signal identification accuracy can be achieved even when the signal duration is tens of microseconds. Simulation results based on NS-3 shows that E-CCA can increase the ZigBee performance dramatically with little throughput degradation for WiFi.

Accelerating PageRank in Shared-Memory for Efficient Social Network Graph Analytics

Baofu Huang, Zhidan Liu and Kaishun Wu

2
PageRank has a wide applications in online social networks and serves as an important benchmark to examine graph processing frameworks. Many efforts have been made to improve the computation efficiency of PageRank in sharedmemory platforms, where a single machine can be sufficiently powerful to handle a large-scale graph. Existing methods, however, still suffer from synchronization issues and irregular memory accesses, which will deteriorate their overall performance. In this paper, we present an accelerated parallel PageRank computation approach, named APPR. By investigating the characteristics of parallel PageRank computation and degree distributions of social network graphs, APPR proposes a series of optimization techniques to improve the efficiency of PageRank computation. Specifically, a destination-centric graph partitioning scheme is designed to avoid the synchronization issues when concurrently updating the common vertex data. By exploiting power-law structure of social network graphs, APPR can intelligently schedule the computations of vertices to save computing operations. The vertex messages are adjusted by APPR for transmission to further improve the locality of memory accesses. Empirical evaluations are performed based on a set of large real-world graphs. Experimental results show that APPR significantly outperforms the state-of-the-art methods, with on average 2.4x speedup in execution time and 16.4x reduction in DRAM communication.

Session Chair

Zhidan Liu (Shenzhen University)

Session C4

Scheduling in Edge Environment

Conference
2:50 PM — 4:10 PM HKT
Local
Dec 2 Wed, 1:50 AM — 3:10 AM EST

Using Configuration Semantic Features and Machine Learning Algorithms to Predict Build Result in Cloud-Based Container Environment

Yiwen Wu, Yang Zhang, Bo Ding, Tao Wang and Huaimin Wang

0
Container technologies are being widely used in large scale production cloud environments, of which Docker has become the de-facto industry standard. In practice, Docker builds often break, and a large amount of efforts are put into troubleshooting broken builds. Prior studies have evaluated the rate at which builds in large organizations fail. However, there is still a lack of early warning methods for predicting the Docker build result before the build starts. This paper provides a first attempt to propose an automatic method named PDBR. It aims to use the configuration semantic features extracted by AST and the machine learning algorithms to predict build result in the cloud-based container environment. The evaluation experiments based on more than 36,000 collected Docker builds show that PDBR achieves 73.45%-91.92% in F1 and 29.72%-72.16% in AUC. We also demonstrate that different ML classifiers have significant and large effects on the PDBR AUC performance.

Joint Service Placement and Computation Offloading in Mobile Edge Computing: An Auction-based Approach

Lei Zhang, Zhihao Qu, Baoliu Ye and Bin Tang

0
The emerging applications, e.g., virtual reality, online games, and Internet of Vehicles, have computation-intensive and latency-sensitive requirements. Mobile edge computing (MEC) is a powerful paradigm that significantly improves the quality of service (QoS) of these applications by offloading computation and deploying services at the network edge. Existing works on service placement in MEC usually ignore the impact of the different requirements of QoS among service providers (SPs), which is common in many applications such that online game requires extremely low latency and online video requires extremely large bandwidth. Considering the competitive relationship among SPs, we propose an auction-based resource allocation mechanism. We formulate the problem as a social welfare maximization problem to maximize effectiveness of allocated resources while maintaining economic robustness. According to our theoretical analysis, this problem is NP-hard, and thus it is practically impossible to derive the optimal solution. To tackle this, we design multiple rounds of iterative auctions mechanism (MRIAM), which divides resources into blocks and allocates them through multiple rounds of auctions. Finally, we conduct extensive experiments and demonstrate that our auction-based mechanism is effective in resource allocation and robust in economics.

Multi-user Edge-assisted Video Analytics Task Offloading Game based on Deep Reinforcement Learning

Yu Chen, Sheng Zhang, Mingjun Xiao, Zhuzhong Qian, Jie Wu and Sanglu Lu

2
With the development of deep learning, artificial intelligence applications and services have boomed in the recent years, including recommendation systems, personal assistant and video analytics. Similar to other services in the edge computing environment, artificial intelligence computing tasks are pushed to the network edge. In this paper, we consider the multi-user edge-assisted video analytics task offloading (MEVAO) problem, where users have video analytics tasks with various accuracy requirements. All users independently choose their accuracy decisions, satisfying the accuracy requirement, and offload the video data to the edge server. With the utility function designed based on the features of video analytics, we model MEVAO as a game theory problem and achieve the Nash equilibrium. For the flexibility of making accuracy decisions under different circumstances, a deep reinforcement learning approach is applied to our problem. Our proposed design has much better performance compared with some other approaches in the extensive simulations.

Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding

Lipeng Wang, Qiong Luo and Shengen Yan

2
In computer vision deep learning (DL) tasks, most of the input image datasets are stored in the JPEG format. These JPEG datasets need to be decoded before DL tasks are performed on them. We observe two problems in the current JPEG decoding procedures for DL tasks: (1) the decoding of image entropy data in the decoder is performed sequentially, and this sequential decoding repeats with the DL iterations, which takes significant time; (2) Current parallel decoding methods under-utilize the massive hardware threads on GPUs. To reduce the image decoding time, we introduce a pre-scan mechanism to avoid the repeated image scanning in DL tasks. Our pre-scan generates boundary markers for entropy data so that the decoding can be performed in parallel. To cooperate with the existing dataset storage and caching systems, we propose two modes of the pre-scan mechanism: a compatible mode and a fast mode. The compatible mode does not change the image file structure so pre-scanned files can be stored back to disk for subsequent DL tasks. In comparison, the fast mode crafts a JPEG image into a binary format suitable for parallel decoding, which can be processed directly on the GPU. Since the GPU has thousands of hardware threads, we propose a finegrained parallel decoding method on the pre-scanned dataset. The fine-grained parallelism utilizes the GPU effectively, and achieves speedups of around 1.5_ over existing GPU-assisted image decoding libraries on real-world DL tasks.

Session Chair

Qingyong Deng (Xiangtan University)

Session C5

Application and Security

Conference
2:50 PM — 4:10 PM HKT
Local
Dec 2 Wed, 1:50 AM — 3:10 AM EST

Scheduling Rechargeable UAVs for Long Time Barrier Coverage

Zhouqing Han, Xiaojun Zhu and Lijie Xu

0
We consider barrier coverage applications where a set of UAVs are deployed to monitor whether intruders pass through a line, i.e., the barrier. Due to limited energy supply of UAVs, a charging pile is used to recharge UAVs. The problem is to place UAVs on top of the barrier and schedule them to the charging pile such that the barrier is seamlessly covered and the total number of UAVs is minimized. We decompose the problem into subproblems by dividing the barrier into disjoint subsegments and covering each subsegment independently. We prove a theoretical lower bound on the minimum number of UAVs required to cover the barrier forever. We then propose two scheduling strategies. In the first strategy, only fully recharged backup UAVs will be scheduled to take over UAVs running out of energy. If there are enough UAVs, this strategy can cover the barrier forever. The second strategy is proposed to deal with the situation that the number of UAVs is insufficient. Under the strategy, if no backup UAV is fully recharged, the one with the most battery energy will be selected to take over the monitoring task. When the number of UAVs is insufficient, the barrier can still be covered for a long time. We analytically derive the number of UAVs required by the first strategy, and the monitoring duration of the second strategy in case of insufficient UAVs. Simulations verify the effectiveness of the proposed solutions.

Use of Genetic Programming Operators in Data Replication and Fault Tolerance

Syed Mohtashim Abbas Bokhari and Oliver Theel

1
Distributed systems are a need of the current times to balance the workload since providing highly accessible data objects is of utmost importance. Faults hinder the availability of the data, thereby leading systems to fail. In this regard, data replication in distributed systems is a means to mask failures and mitigate any such possible hindrances in the availability of the data. This replicated behavior is then controlled by data replication strategies, but there are numerous scenarios reflecting different trade-offs between several quality metrics. It demands designing new replication strategies optimized for the given scenarios, which may be left unaddressed otherwise. This research, therefore, uses an automatic mechanism based on genetic programming to construct new optimized replication strategies (up-to-now) unknown. This mechanism uses a so-called voting structure of directed acyclic graphs (each representing a computer program) as a unified representation of replication strategies. These structures are interpreted by our general algorithm at run-time in order to derive respective quorums to manage replicated objects eventually. For this, the research particularly demonstrates the usefulness of various genetic operators through their instances, exploiting the heterogeneity between existing strategies, thereby creating innovative strategies flexibly. This mechanism creates new hybrid strategies and evolves them over several generations of evolution, to make them optimized while maintaining the consistency (validity) of the solutions. Our approach is very effective and extremely flexible to offer competitive results with respect to the contemporary strategies as well as generating novel strategies even with a slight use of relevant genetic operators.

Multiple Balance Subsets Stacking for Imbalanced Healthcare Datasets

Yachao Shao, Tao Zhao, Xiaoning Wang, Xiaofeng Zou and Xiaoming Fu

1
Accurate prediction is highly important for clinical decision making and early treatment. In this paper, we study the imbalanced data problem in prediction, a key challenge existing in the healthcare area. Imbalanced datasets bias classifiers towards the majority class, leading to an unsatisfied classification prediction performance on the minority class, which is known as imbalance problem. Existing imbalance learning methods may suffer from issues like information loss, overfitting, and high training time cost. To tackle these issues, we propose a novel ensemble learning method called Multiple bAlance Subsets Stacking (MASS) by exploiting a multiple balance subsets construction strategy. Furthermore, we improve MASS with introducing parallelism (Parallel MASS) to reduce the training time cost. We evaluate MASS on three real-world healthcare datasets, and experimental results demonstrate that its prediction performance outperforms the state-of-art methods in terms of AUC, F1-score and MCC. Through the speedup analysis, Parallel MASS reduces the training time cost greatly on large dataset, and its speedup increases as the data size grows.

Securing App Behaviors in Smart Home: A Human-App Interaction Perspective

Jinlei Li, Yan Meng, Lu Zhou and Haojin Zhu

1
Smart home has become a mainstream lifestyle due to the maturity of the IoT platform and the popularity of smart devices. While offering great convenience and entertainment, smart home suffers from malicious attacks that inject improper commands and actions to home devices, which may breach the user��s safety and privacy. Traditional solutions mainly focus on generating security policies relying on app analysis to constraint apps�� behaviors. However, these policies lack flexibility to adapt to the highly dynamic smart home system. We need to consider not only the app behaviors but also the user behaviors for enforcing an appropriate security policy. In this study, we propose WIPOLICY, a cross-layer security enforcement system for smart home by monitoring the behaviors of both apps and users. The key novelty of WIPOLICY is incorporating user activity recognition via the physicallayer wireless signals into the definition and enforcement of security policies to constraint the app behavior. We implement WIPOLICY on the Samsung SmartThings platform with 187 SmartApps, and 24 behavior policies are defined and enforced. The case study demonstrates the effectiveness of WIPOLICY on thwarting app��s misbehavior.

Session Chair

Pengfei Wang (Dalian University of Technology)

Session D1

Edge-Cloud Performance

Conference
4:20 PM — 5:40 PM HKT
Local
Dec 2 Wed, 3:20 AM — 4:40 AM EST

Task Offloading in Trusted Execution Environment empowered Edge Computing

Yuepeng Li, Deze Zeng, Lin Gu, Andong Zhu and Quan Chen

0
To tackle the computation resource poorness on the end devices, task offloading is developed to reduce the task completion time and improve the Quality-of-Service (QoS). Edge computing facilitates such offloading by provisioning resources at the proximity of the end devices. Nowadays, many tasks on end devices have an urgent demand for the security of execution environment. To address this problem, we introduce trusted execution environment (TEE) to empower edge computing for secure task offloading. To explore TEE, the offloading process should be redesigned with the introduction of data encryption and decryption. This makes traditional offloading optimization policy fail to be applied directly. To address this issue, we are motivated to take the data encryption and decryption into the offloading scheduling algorithm. In particular, we propose a Customized List Scheduling based Offloading (CLSO) algorithm, aiming at minimizing the total completion time with the consideration of energy budget limitations on the end devices. The experiment results show that our approximation algorithm can effectively reduce the total completion time and significantly outperforms existing state-of-the-art offloading strategy.

Gecko: Guaranteeing Latency SLO on a Multi-Tenant Distributed Storage System

Zhenyu Leng, Dejun Jiang, Liuying Ma and Jin Xiong

0
Meeting tail latency Service Level Objective (SLO) as well as achieving high resource utilization is important to distributed storage systems. Recent works adopt strict priority scheduling or constant rate limiting to provide SLO guarantee but cause under-utilization resources. To address this issue, we first analyze the relationship between workload burst and latency SLO. Based on burst patterns and latency SLOs, we classify tenants into two categories: Postponement-Tolerable tenant and Postponement-Intolerable tenant. We then explore the opportunity to improve resource utilization by carefully allocating resources to each tenant type. We design Rate- Limiting-Priority scheduling algorithm to limit the impact of high priority tenants on low priority ones. Meanwhile, we propose Postponement-Aware scheduling algorithm which allows Postponement-Intolerable tenants to preempt system capacity from Postponement-Tolerable tenants. This helps to increase resource utilization. We propose a latency SLO guarantee framework Gecko. Gecko guarantees multi-tenant latency SLOs via combining the two proposed scheduling algorithms together with an admission control strategy. We evaluate Gecko with real production traces and the results show that Gecko admits 44% more tenants on average than state-of-the-art techniques meanwhile guaranteeing latency SLO.

A Learning-based Dynamic Load Balancing Approach for Microservice Systems in Multi-cloud Environment

Jieqi Cui, Pengfei Chen and Guangba Yu

0
Multi-cloud environment has become common since companies manage to prevent cloud vendor lock-in for security and cost concerns. Meanwhile, the microservice architecture is often considered for its flexibility. Combining multicloud with microservice, the problem of routing requests among all possible microservice instances in multi-cloud environment arises. This paper presents a learning-based approach to route requests in order to balance the load. In our approach, the performance of microservice is modeled explicitly through machine learning models. The model can derive the response time from request volume, route decision, and other cloud metrics. Then the balanced route decision is obtained from optimizing the model with Bayesian Optimization. With this approach, the request route decision can adjust to dynamic runtime metrics instead of remaining static for all different circumstances. Explicit performance modeling avoids searching on an actual microservice system which is time-consuming. Experiments show that our approach reduces average response time by 10% at least.

Enhancing Availability for the MEC Service: CVaR-based Computation Offloading

Shengli Pan, Zhiyong Zhang, Tao Xue and Guangmin Hu

0
Mobile Edge Computing (MEC) enables mobile users to offload their computation loads to nearby edge servers, and is seen to be integrated in the 5G architecture to support a variety of low-latency applications and services. However, an edge server might soon be overloaded when its computation resources are heavily requested, and would then fail to process all of its received computation loads in time. Unlike most of existing schemes that ingeniously instruct the overloaded edge server to transfer computation loads to the remote cloud, we make use of the spare computation resources from other local edge servers by specially taking the risk of network link failures into account. We measure such link failure risks with the financial risk management metric of Conditional Value-at- Risk (CVaR), and well constrain it to the offloading decisions using a Minimum Cost Flow (MCF) problem formulation. Numerical results validate the enhancement of the MEC service��s availability by our risk-aware offloading scheme.

Session Chair

Yufeng Zhan (Beijing Institute of Technology)

Session D2

Performance of Distributed Systems

Conference
4:20 PM — 5:20 PM HKT
Local
Dec 2 Wed, 3:20 AM — 4:20 AM EST

Intermediate Value Size Aware Coded MapReduce

Yamei Dong, Bin Tang, Baoliu Ye, Zhihao Qu and Sanglu Lu

0
MapReduce is a commonly used framework for parallel processing of data-intensive tasks, but its performance usually suffers from heavy communication load incurred by the shuffling of intermediate values (IVs) among computing servers. Recently, the Coded MapReduce framework is proposed which uses a coding scheme named coded distributed computing (CDC) to trade the communication load with extra computation in MapReduce. CDC can achieve the optimal computation-communication tradeoff when all the IVs have the same size. However, in many practical applications, the sizes of IVs can vary over a large range, leading to inferior performance. In this paper, we introduce a generalized CDC scheme which takes the sizes of IVs into account and then propose a combinatorial optimization problem aiming to minimize the communication load when the computation load is fixed. We show that the problem is NP-hard, and further propose a very efficient algorithm which achieves an approximation ratio of 2. Experiments conducted on Alibaba Cloud show that, compared to the original CDC scheme, our proposed IV size aware approach can significantly reduce the communication load and achieve a lower total execution time.

A Customized Reinforcement Learning based Binary Offloading in Edge Cloud

Yuepeng Li, Lvhao Chen, Deze Zeng and Lin Gu

0
To tackle the computation resource poorness on the end devices, task offloading is developed to reduce the task completion time and improve the Quality-of-Service (QoS). Edge cloud facilitates such offloading by provisioning resources at the proximity of the end devices. Modern applications are usually deployed as a chain of subtasks (e.g., microservices) where a special offloading strategy, referred as binary offloading, shall be applied. Binary offloading divides the chain into two parts, which will be executed on end device and the edge cloud, respectively. The offloading point in the chain therefore is critical to the QoS in terms of task completion time. Considering the system dynamics and algorithm sensitivity, we apply Q-learning to address this problem. In order to deal with the late feedback problem, a reward rewind match strategy is proposed to customize Q-learning. Trace-driven simulation results show that our customized Q-learning based approach is able to achieve significant reduction on the total execution time, outperforming traditional offloading strategies and noncustomized Q-learning.

Optimal Use Of The TCP/IP Stack in User-space Storage Applications with ADQ feature in NIC

Ziye Yang, Ben Walker, James R Harris, Yadong Li, and Gang Cao

0
For storage applications based on TCP/IP, performance of the TCP/IP stack is often the dominant driver of the application��s overall performance. In this paper, we introduce Tbooster, which is a library designed for TCP/IP based storage applications that optimally leverages the Linux kernel��s TCP/IP stack and the socket interface to improve performance. This library allows for efficient grouping of connections onto threads and asynchronous, poll-mode operation, scaling to a massive number of TCP connections on each thread. Tbooster is designed to avoid making expensive system calls by batching and merging operations into a single operation per connection for each poll loop. Further, Tbooster leverages the Application Device Queues (ADQ) feature available in some Network Interface Cards to steer incoming data to the correct NIC receive queue. This avoids expensive coordination and message passing within the kernel when handling incoming data and especially improves outlier tail latencies of requests. Compared with a more standard usage of the Linux kernel TCP/IP stack, Tbooster improves storage I/O per second significantly (e.g., 9% to 22.7% IOPS improvement for an iSCSI target at 4KiB I/O size, 36.3% to 59.4% IOPS improvement for NVMe-oF TCP target on 8KiB I/O size). Moreover, when using the ADQ feature from Intel��s 100GbE NIC, it demonstrates 30% average time reduction of 99.99% long tail latency under heavy workloads for NVMe-oF TCP.

Session Chair

Shengli Pan (China University of Geosciences (Wuhan))

Session E1

Cyber Physical Security

Conference
4:20 PM — 5:40 PM HKT
Local
Dec 2 Wed, 3:20 AM — 4:40 AM EST

SIC^2: Securing Microcontroller Based IoT Devices with Low-cost Crypto Coprocessors

Bryan Pearson, Cliff Zou, Yue Zhang, Zhen Ling and Xinwen Fu

1
In this paper, we explore the use of microcontrollers (MCUs) and crypto coprocessors to secure IoT applications, and show how developers may implement a low-cost platform that provides protects private keys against software attacks. We first demonstrate the plausibility of format string attacks on the ESP32, a popular MCU from Espressif that uses the Harvard architecture. The format string attacks can be used to remotely steal private keys hard-coded in the firmware. We then present a framework termed SIC2 (Securing IoT with Crypto Coprocessors), for secure key provisioning that protects end users�� private keys from both software attacks and untrustworthy manufacturers. As a proof of concept, we pair the ESP32 with the low-cost ATECC608A cryptographic coprocessor by Microchip and connect to Amazon Web Services (AWS) and Amazon Elastic Container Service (EC2) using a hardware-protected private key, which provides the security features of TLS communication including authentication, encryption and integrity. We have developed a prototype and performed extensive experiments to show that the ATECC608A crypto chip may significantly reduce the TLS handshake time by as much as 82% with the remote server, and it may lower the total energy consumption of the system by up to 70%. Our results indicate that securing IoT with crypto coprocessors is a practicable solution for low-cost MCU based IoT devices.

Intelligent detection algorithm against UAVs' GPS spoofing attack

Shenqing Wang, Jiang Wang��Chunhua Su, and Xinshu Ma

0
Unmanned Aerial Vehicle (UAV) technology is more and more widely used in the field of civil and military information acquisition. GPS plays the most critical part of UAVs�� navigation and positioning. However, since the communication channel of the GPS signals is open, attackers can disguise as real GPS signals to launch GPS spoofing attacks on civilian UAVs. At present, the detection schemes for GPS spoofing attacks can be divided into three categories respectively based on encryption and digital signatures, the characteristics of the GPS signal and various external characteristics of UAVs. However, there are some problems in these methods, such as low computing efficiency, difficulty in equipment upgrading, and limited application scenarios.To solve these problems, we propose a new GPS spoofing attack detection method based on Long Short-Term Memory (LSTM) which is a machine learning algorithm. In order to improve the detection ratio, after the machine learning algorithm, we let the UAVs fly according to the path of a specific shape to accurately detect GPS spoofing attacks. This is also the first time machine learning has been used to detect GPS spoofing attacks. According to our algorithm, we can detect GPS spoofing attacks accurately and quickly in a short time. This paper describes in detail the algorithm we proposed to resist GPS spoofing attacks, and the corresponding experiments are carried out in the simulation environment. The experimental results show that our method can quickly and accurately detect UAV GPS spoofing attacks without requiring upgrades to existing equipment.

An Efficient and Scalable Sparse Polynomial Multiplication Accelerator for LAC on FPGA

Jipeng Zhang, Zhe Liu, Hao Yang, Junhao Huang and Weibin Wu

0
LAC, a Ring-LWE based scheme, has shortlisted for the second round evaluation of the National Institute of Standards and Technology Post-Quantum Cryptography (NIST-PQC) Standardization. FPGAs are widely used to design accelerators for cryptographic schemes, especially in resourceconstrained scenarios, such as IoT. Sparse Polynomial Multiplication (SPM) is the most compute-intensive routine in LAC. Designing an accelerator for SPM on FPGA can significantly improve the performance of LAC. However, as far as we know, there are currently no works related to the hardware implementation of SPM for LAC. In this paper, the proposed efficient and scalable SPM accelerator fills this gap. More concretely, we firstly develop the Dual-For-Loop-Parallel (DFLP) technique to optimize the accelerator��s parallel design. This technique can achieve 2x performance improvement compared with the previous works. Secondly, we design a hardwarefriendly modular reduction algorithm for the modulus 251. Our method not only saves hardware resources but also improves performance. Then, we launch a detailed analysis and optimization of the pipeline design, achieving a frequency improvement of up to 34%. Finally, our design is scalable, and we can achieve various performance-area trade-offs through parameter p. Our results demonstrate that the proposed design can achieve a very considerable performance improvement with moderate hardware area costs. For example, our mediumscale architecture for LAC-128 takes only 783 LUTs, 432 FFs, 5BRAMs, and no DSP on an Artix-7 FPGA and can complete LAC��s polynomial multiplication in 8512 cycles at a frequency of 202MHz.

Secure and Verifiable Data Access Control Scheme With Policy Update and Computation Outsourcing for Edge Computing

Yue Guan, Songtao Guo, Pan Li and Yuanyuan Yang

0
Edge computing means that computing tasks are executed on edge devices closer to the data source. It can effectively improve system response speed and reduce the risk of user data leakage. However, current data access control schemes usually focus on cloud computing and rarely on edge computing. Although attribute-based encryption (ABE) scheme can realize flexible and reliable access control, computing cost is too high with the increase of access policy complexity. Therefore, combining computation outsourcing technology with dynamic policy updating technology, we propose a data access control scheme based on ciphertext-policy ABE (CP-ABE) for edge computing. We outsource part of storage service and part of decryption computing to edge nodes, effectively reducing the computing pressure of users. When data owner requires a new access policy, policy update key is generated timely and transmitted to cloud service provider, which is used to update the access policy, reducing the risk of bandwidth consumption and leakage of the ciphertext back and forth transmission. Finally, security analysis and experiment results verify the safety and effectiveness of our scheme.

Session Chair

Chunpeng Ge (Nanjing University of Aeronautics and Astronautics)

Session E2

AI and Distributed System Security

Conference
4:20 PM — 6:00 PM HKT
Local
Dec 2 Wed, 3:20 AM — 5:00 AM EST

Secure Door on Cloud: A Secure Data Transmission Scheme to Protect Kafka's Data

Hanyi Zhang, Liming Fang, Keyu Jiang, Weiting Zhang, Minghui Li and Lu Zhou

0
Apache Kafka, which is a high-throughput distributed message processing system, has been leveraged by the majority of enterprise for its outstanding performance. Un-like common cloud-based access control architectures, Kafka service providers often need to build their systems on other enterprises high-performance cloud platforms. However, since the cloud platform belongs to a third party, it is not necessarily reliable. Paradoxically, it has been demonstrated that Kafka's data is stored in the cloud in the plaintext form, and thus poses a serious risk of user privacy leakage. In this paper, we propose a secure fine-grained data transmission scheme called Secure Door on Cloud (SDoC) to protect the data from being leaked in Kafka. SDoC is not only more secure than Kafka's built-insecurity mechanism, but also can effectively prevent third-party cloud from stealing plaintext data. To evaluate the performance of the SDoC, we simulate normal inter-entity communication and show that Kafka with SDoC integration has a lower data transfer time overhead than that of Kafka with built-in security mechanism opened.

A Solution to Data Accessibility Across Heterogeneous Blockchains

Zhihui Wu, Yang Xiao, Enyuan Zhou, Qingqi Pei, and Quan Wang

0
Cross-heterogeneous blockchain interactions have been attracting much attention due to their application in depository blockchains mutual access and cross-blockchain identity authentication. Trusted access across heterogeneous chains is gradually becoming a hot challenge. In order to ensure cross-blockchain trusted access, the majority of the current works focus on on-chain notaries and the relay chain model. However, these methods have the following drawbacks: 1) notaries on the chain are more vulnerable to attacks due to their high degree of centralization, which causes off-chain users to lose their trust and thus exacerbates the off-chain trust crisis; 2) although the relay model involves multiple parties in maintenance and supervision and enjoys a more robust trust, the paticipatant nodes are relatively fixed, which impose a terrible dilemma that invalid nodes cannot participate in consensus formation in a timely manner, thus progressively disrupting the connectivity of the relay across heterogeneous chains and eventually reducing the rate of trusted mutual access.
In this article, we propose a novel general framework for cross-heterogeneous blockchain communication based on a periodical committee rotation mechanism to support information exchange of diverse transactions across multiple heterogeneous blockchain systems. Connecting heterogeneous blockchains through committees has a more robust trust than the notary method. In order to eliminate the impact of downtime nodes in a timely manner, we periodically reorganize the committee and give priority to replacing downed nodes to ensure the reliability of the system. In addition, a message-oriented verification mechanism is designed to improve the rate of trusted intervisit across heterogeneous chains. We have built a prototype of the scheme and conducted simulation experiments on the current mainstream blockchain for message exchange across heterogeneous chains. The results show that our solution has a good performance both in inter-chain access rate and system stability.

PrivAG: Analyzing Attributed Graph Data with Local Differential Privacy

Zichun Liu, Liusheng Huang, Hongli Xu, Wei Yang and Shaowei Wang

0
Attributed graph data is powerful to describe relational information in various areas, such as social links through numerous web services and citation/reference relations in the collaboration network. Taking advantage of attributed graph data, service providers can model complex systems and capture diversified interactions to achieve better business performance. However, privacy concern is a huge obstacle to collect and analyze user��s attributed graph data.
Existing studies on protecting private graph data mainly focus on edge local differential privacy(LDP), which might be insufficient in some highly sensitive scenarios. In this paper, we present a novel privacy notion that is stronger than edge LDP, and investigate approaches to analyze attributed graphs under this notion. To neutralize the effect of excessively introduced noise, we propose PrivAG, a privacy-preserving framework that protects attributed graph data in the local setting while providing representative graph statistics. The effectiveness and efficiency of PrivAG framework is validated through extensive experiments.

Exploring Data Correlation between Feature Pairs for Generating Constraint-based Adversarial Examples

Yunzhe Tian, Yingdi Wang, Endong Tong, Wenjia Niu, Liang Chang, Qi Alfred Chen, Gang Li and Jiqiang Liu

1
Adversarial example (AE), an input that is modified slightly to cause a machine learning system to produce erroneous outputs, has seen significant studies recently. Unfortunately, the fine data perturbation of AE ignores to keep potential data correlations between feature pairs. Thus, such AE will be easily filtered by configuring data correlations as basic filtering rules. In this paper, avoiding not to be filtered as well as causing false classification, an advanced robust AE generation attack is proposed. We first define four basic data correlations called strict linear constraint, approximate linear constraint, addition boundary constraint and zero multiplication constraint. Then, based on embedding multiple data correlations into one constraint matrix from the Pearson analysis, our approach can enable a Hadamard product of the constraint matrix and the sign of gradient matrix to craft perturbations, keeping consistent data correlations. Experimental results on intrusion detection system (IDS) indicate: 1) Nearly all AEs from original IFGSM are invalid by filtering according to basic data correlations; 2) In our method, AEs against a targeted DNN-based classifier can achieve an attack success rate of 99%, with transfer attack ability of 94% average success rate to attack other different mainstream classifiers.

A Deep Learning Framework Supporting Model Ownership Protection and Traitor Tracing

Guowen Xu, Hongwei Li, Yuan Zhang, Xiaodong Lin, Robert H. Deng and Xuemin Shen

0
Cloud-based deep learning (DL) solutions have been widely used in applications ranging from image recognition to speech recognition. Meanwhile, as commercial software and services, such solutions have raised the need for intellectual property rights protection of the underlying DL models. Watermarking is the mainstream of existing solutions to address this concern, by primarily embedding pre-defined secrets in a model��s training process. However, existing efforts almost exclusively focus on detecting whether a target model is pirated, without considering traitor tracing. In this paper, we present SecureMark DL, which enables a model owner to embed a unique fingerprint for every customer within parameters of a DL model, extract and verify the fingerprint from a pirated model, and hence trace the rogue customer who illegally distributed his model for profits. We demonstrate that SecureMark DL is robust against various attacks including fingerprints collusion and network transformation (e.g., model compression and model fine-tuning). Extensive experiments conducted on MNIST and CIFAR10 datasets, as well as various types of deep neural network show the superiority of SecureMark DL in terms of training accuracy and robustness against various types of attacks.

Session Chair

Yang Xiao (Xidian University)

Made with in Toronto · Privacy Policy · © 2020 Duetone Corp.