Session Keynote-3

Keynote 3

Conference
9:30 AM — 10:15 AM HKT
Local
Dec 2 Wed, 8:30 PM — 9:15 PM EST

Towards the Acceleration of Enterprise AI

Hui Lei (VP, Futurewei Technologies)

2
As much as enterprises are eager to embrace AI to innovate products, transform business, reduce costs, and improve competitive advantages, they find it very difficult to productionize AI and realize its value. Despite a good number of AI pilot projects for evaluation purposes, only a small portion of those have turned into revenue-bearing production. Some industry analysts have pegged the enterprise adoption of AI at less than 20%, and the world is still far away from AI democratization. Surprisingly, difficulties with productionizing AI have little to do with core machine learning algorithms and techniques. Instead, they have a lot to do with the huge leap required from the development of machine learning prototypes in lab settings to the development of large-scale enterprise-grade AI systems. In order to adopt AI at scale and minimize time to value, enterprises must solve problems in a wide variety of areas including infrastructure, data, skills, trust, and operationalization. That, in turn, opens many opportunities for research innovations. In this talk, I will discuss the current state of enterprise AI and sample the associated technical challenges. I will also present some research directions that will advance the state of the art and help unlock AI?s potential to businesses and society.

Session Chair

Song Guo (The Hong Kong Polytechnic University)

Session Award

Best Paper Award

Conference
10:15 AM — 10:35 AM HKT
Local
Dec 2 Wed, 9:15 PM — 9:35 PM EST

Session Chair

Song Guo (The Hong Kong Polytechnic University)

Session F1

Data Processing

Conference
10:35 AM — 11:55 AM HKT
Local
Dec 2 Wed, 9:35 PM — 10:55 PM EST

A social link based private storage cloud

Michalis Konstantopoulos, Nikos Chondros and Mema Roussopoulos

0
In this paper, we present O^3, a social link based private storage cloud for decentralized collaboration. O^3 allows users to share and collaborate on collections of files (shared folders) with others, in a decentralized manner, without the need for intermediaries (such as public cloud storage servers) to intervene. Thus, users can keep their working relationships (who they work with) and what they work on private from third parties. Using benchmarks and traces from real workloads, we experimentally evaluate O^3 and demonstrate that the system scales linearly when synchronizing increasing numbers of concurrent users, while performing on-par with the ext4 non-version-tracking filesystem.

Enabling Generic Verifiable Aggregate Query on Blockchain Systems

Yanchao Zhu, Zhao Zhang, Cheqing Jin, and Aoying Zhou

0
Currently, users in a blockchain system must maintain all the data on the blockchain and query the data locally to ensure the integrity of the query results. However, since data is updated in an append-only way, resulting in a huge amount of data, it will take considerable maintenance costs to users. In this paper, we present an approach to support verifiable aggregate queries on blockchain systems that alleviates both storage and computing costs for users, while ensuring the integrity of the query results. We design an accumulator-based authenticated data structure (ADS) that supports verifiable multidimensional aggregate queries (i.e., aggregate queries with multiple selection predicates). The structure is built for each block, based on which verifiable multidimensional aggregate queries within a single block or involving multiple blocks are supported. We further optimize the performance by merging ADSs on different blocks to reduce the verification time at the client side and reduce the verification object (VO) size. Extensive experiments demonstrate the effectiveness and efficiency of our proposed approach.

SrSpark: Skew-resilient Spark based on Adaptive Parallel Processing

Yijie Shen, Jin Xiong and Dejun Jiang

0
MapReduce-based SQL processing systems, e.g., Hive and Spark SQL, are widely used for big data analytic applications due to automatic parallel processing on largescale machines. They provide high processing performance when loads are balanced across the machines. However, skew loads are not rare in real applications. Although many efforts have been made to address the skew issue in MapReduce-based systems, they can neither fully exploit all available computing resources nor handle skews in SQL processing. Moreover, none of them can expedite the processing of skew partitions in case of failures. In this paper, we present SrSpark, a MapReduce-based SQL processing system that can make full use of all computing resources for both non-skew loads and skew loads. To achieve this goal, SrSpark introduces fine-grained processing and workstealing into the MapReduce framework. More specifically, SrSpark is implemented based on Spark SQL. In SrSpark, partitions are further divided into sub-partitions and processed in sub-partition granularity. Moreover, SrSpark adaptively uses both intra-node and inter-node parallel processing for skew loads according to available computing resources in realtime. Such adaptive parallel processing increases the degree of parallelism and reduces the interaction overheads among the cooperative worker threads. In addition, SrSpark checkpoints sub-partition��s processing results periodically to ensure fast recovery from failures during skew partition processing. Our experiment results show that for skew loads, SrSpark outperforms Spark SQL by up to 3.5x, and 2.2x on average, while the performance overhead is only about 4% under non-skew loads.

Optimizing Multi-way Theta Join for Data Skew in Sub-Second Stream Computing

Xiaopeng Fan, Xinchun Liu, Yang Wang, Youjun Wang, and Jing Li

0
In sub-second stream computing, the answer to a complex query usually depends on operations of aggregation or join on streams, especially multi-way theta join. Some attribute keys are not distributed uniformly, which is called the data intrinsic skew problem, such as taxi car plate in GPS trajectories and transaction records, or stock code in stock quotes and investment portfolios etc. In this paper, we define the concept of key redundancy for single stream as the degree of data intrinsic skew, and joint key redundancy for multi-way streams. We present an execution model for multi-way stream theta joins with a fine-grained cost model to evaluate its performance. We propose a solution named Group Join (GroJoin) to make use of key redundancy during transmission and execution in a cluster. GroJoin is adaptive to data intrinsic skew in the way that it depends on the grouping condition we find out, i.e., the selectivity of theta join results should be smaller than 25%. Experiments are carried out by our MS-Generator to produce multi-way streams, and the simulation results show that GroJoin can decrease at most 45% transmission overheads with different key redundancies and value-key proportionality coefficients, and reduce at most 70% query delay with different key distributions. We further implement GroJoin in Multi-Way Stream Theta Join by Spark Streaming. The experimental results demonstrate that there are about 40%~50% join latency reduced after our optimization with a very small computation cost.

Session Chair

Weigang Wu (Sun Yat-sen University)

Session F2

Resource and Data Management

Conference
10:35 AM — 11:55 AM HKT
Local
Dec 2 Wed, 9:35 PM — 10:55 PM EST

OOOPS: An Innovative Tool for IO Workload Management on Supercomputers

Lei Huang and Si Liu

1
Modern supercomputer applications are demanding high-performance storage resources in addition to fast computing resources. However, these storage resources, especially parallel shared filesystems, have become the Achilles�� heel of many powerful supercomputers. Due to the lack of mechanism of IO resource provisioning on the file server side, a single user��s IOintensive work running on a small number of nodes can overload the metadata server and result in global filesystem performance degradation and even unresponsiveness. To tackle this issue, we developed an innovative tool, Optimal Overloaded IO Protection System (OOOPS). This tool is designed to control the IO workload from applications side. Supercomputer administrators can easily assign the maximum number of function calls of open() and stat() allowed per second. OOOPS can automatically detect and throttle intensive IO workload to protect parallel shared filesystems. It also allows supercomputer administrators to dynamically adjust how much metadata throughput one job can utilize during the job runs without interruption.

URFS: A User-space Raw File System based on NVMe SSD

Yaofeng Tu, Yinjun Han, Zhenghua Chen, Zhengguang Chen and Bing Chen

1
NVMe (Non-Volatile Memory Express) is a protocol designed specifically for SSD (Solid State Drive), which has significantly improved the performance of SSD storage devices. However, the traditional kernel-space IO path hinders the performance of NVMe SSD devices. In this paper, a userspace raw file system (URFS) based on NVMe SSD is proposed. Through the design of the user-space multi-process shared cache, multiple applications can share access to SSD to reduce the amount of SSD access; NVMe-oriented log-free data layout and Multi-granularity IO queue elastic separation technology are used to improve system performance and throughput. Experiments show that, compared to traditional file systems, URFS performance is improved by more than 23% in CDN (Content Delivery Network) scenarios, and URFS performance is improved more in small file scenarios and read-intensive scenarios.

DyRAC: Cost-aware Resource Assignment and Provider Selection for Dynamic Cloud Workloads

Yannis Sfakianakis, Manolis Marazakis and Angelos Bilas

0
A primary concern for cloud users is how to minimize the total cost of ownership of cloud services. This is not trivial to achieve due to workload dynamics. Users need to select the number, size, type of VMs, and the provider to host their services based on available offerings. To avoid the complexity of re-configuring a cloud service, related work commonly approaches cost minimization as a packing problem that minimizes the resources allocated to services. However, this approach does not consider two problem dimensions that can further reduce cost: (1) provider selection and (2) VM sizing. In this paper, we explore a more direct approach to cost minimization by adjusting the type, number, size of VM instances, and the provider of a cloud service (i.e. a service deployment) at runtime. Our goal is to identify the limits in service cost reduction by online re-deployment of cloud services. For this purpose, we design DyRAC, an adaptive resource assignment mechanism for cloud services that, given the resource demands of a cloud service, estimates the most cost-efficient deployment. Our evaluation implements four different resource assignment policies to provide insight into how our approach works, using VM configurations of actual offerings from main providers (AWS, GCP, Azure). Our experiments show that DyRAC reduces cost by up to 33% compared to typical strategies.

WMAlloc: A Wear-Leveling-Aware Multi-Grained Allocator for Persistent Memory File Systems

Shun Nie, Chaoshu Yang, Runyu Zhang, Wenbin Wang, Duo Liu and Xianzhang Chen

0
Emerging Persistent Memories (PMs) are promised to revolutionize the storage systems by providing fast, persistent data access on the memory bus. Therefore, persistent memory file systems are developed to achieve high performance by exploiting the advanced features of PMs. Unfortunately, the PMs have the problem of limited write endurance. Furthermore, the existing space management strategies of persistent memory file systems usually ignore this problem, which can cause that the write operations concentrate on a few cells of PM. Then, the unbalanced writes can damage the underlying PMs quickly, which seriously damages the data reliability of the file systems. However, existing wear-leveling-aware space management techniques mainly focus on improving the wear-leveling accuracy of PMs rather than reducing the overhead, which can seriously reduce the performance of persistent memory file systems. In this paper, we propose a Wear-Leveling-Aware Multi-Grained Allocator, called WMAlloc, to achieve the wear-leveling of PM while improving the performance for persistent memory file systems. WMAlloc adopts multiple heap trees to manage the unused space of PM, and each heap tree represents an allocation granularity. Then, WMAlloc allocates less-worn required blocks from the heap tree for each allocation. We implement the proposed WMAlloc in Linux kernel based on NOVA, a typical persistent memory file system. Compared with DWARM, the state-of-the-art and wear-leveling-aware space management technique, experimental results show that WMAlloc can achieve 1.52�� lifetime of PM and 1.44�� performance improvement on average.

Session Chair

Huawei Huang (Sun Yat-sen University)

Session F3

Secure and Reliable Systems

Conference
10:35 AM — 11:55 AM HKT
Local
Dec 2 Wed, 9:35 PM — 10:55 PM EST

Optimizing Complex OpenCL Code for FPGA: A Case Study on Finite Automata Traversal

Marziyeh Nourian, Mostafa Eghbali Zarch and Michela Becchi

0
While FPGAs have been traditionally considered hard to program, recently there have been efforts aimed to allow the use of high-level programming models and libraries intended for multi-core CPUs and GPUs to program FPGAs. For example, both Intel and Xilinx are now providing toolchains to deploy OpenCL code onto FPGA. However, because the nature of the parallelism offered by GPU and FPGA devices is fundamentally different, OpenCL code optimized for GPU can prove very inefficient on FPGA, in terms of both performance and hardware resource utilization.
This paper explores this problem on finite automata traversal. In particular, we consider an OpenCL NFA traversal kernel optimized for GPU but exhibiting FPGA-friendly characteristics, namely: limited memory requirements, lack of synchronization, and SIMD execution. We explore a set of structural code changes, custom and best-practice optimizations to retarget this code to FPGA. We showcase the effect of these optimizations on an Intel Stratix V FPGA board using various NFA topologies from different application domains. Our evaluation shows that, while the resource requirements of the original code exceed the capacity of the FPGA in use, our optimizations lead to significant resource savings and allow the transformed code to fit the FPGA for all considered NFA topologies. In addition, our optimizations lead to speedups up to 4x over an already optimized code-variant aimed to fit the NFA traversal kernel on FPGA. Some of the proposed optimizations can be generalized for other applications and introduced in OpenCL-to-FPGA compiler.

FastCredit: Expediting Credit-based Proactive Transports in Datacenters

Dezun Dong, Shan Huang, Zejia Zhou, Wenxiang Yang and Hanyi Shi

0
Recent proposals have leveraged emerging creditbased proactive transports to achieve high throughput low latency datacenter network transports. Particularly, those transports that employ hop-by-hop credits have the merits of fast convergence, low buffer occupancy, and strong congestion avoidability. However, they fairly transmit long flows and latency-sensitive short flows, which will cause the transmission latency of short flows and the average flow completion time increased. Although flow scheduling mechanisms have studied extensively to accelerate short flow transmission, they are hard to be directly applied in credit-based transports. The root cause is that most traditional flow scheduling mechanisms mainly work in the long queue containing flows in various sizes, while credit-based proactive transports maintain the extremely short bounded queue, near zero.

Based on this observation, this paper makes the first attempt to accelerate short-flow scheduling in credit-based proactive transport, and proposed FastCredit. FastCredit can be used as a general building block to expedite short flows in creditbased proactive transports. In FastCredit, we schedule credit transmission at both receivers and switches to indirectly perform flow scheduling, and develop a mechanism to mitigate credit waste and improve network goodput. Compared to the stateof- the-art credit-based transport protocol, FastCredit reduces average flow completion time to 0.78x and greatly improves the short flow transmission latency to 0.51x in realistic workloads. Especially, FastCredit reduces average flow completion time to 0.76x under incast circumstances and 0.62x in many-to-one traffic mode. Furthermore, FastCredit still maintains the advantages of short queue and high throughput.

Proactive Failure Recovery for Stateful NFV

Zhenyi Huang and Huawei Huang

1
Network Function Virtualization (NFV) technology is viewed as a significant component of both the fifth-generation (5G) communication networks and edge computing. In this paper, through reviewing the state-of-the-art work on applying NFV to edge computing, we identify that an urgent research challenge is to provide the proactive failure recovery mechanism for the stateful NFV. To realize such proactive failure recovery, we propose a prediction-based algorithm for redeploying the stateful NFV instances in real-time when network failures occur. The proposed algorithm is based on relax and rounding technique. The theoretical performance guarantee is also analyzed rigorously. Simulation results show that the proposed failure recovery algorithm outperforms the reactive-manner baselines significantly in terms of redeployment latency.

TEEp: Supporting Secure Parallel Processing in ARM TrustZone

Zinan Li, Wenhao Li, Yubin Xia and Binyu Zang

0
Machine learning applications are getting prevelent on various computing platforms, including cloud servers, smart phones, IoT devices, etc. For these applications, security is one of the most emergent requirements. While trusted execution environment (TEE) like ARM TrustZone has been widely used to protect critical prodecures including fingerprint authentication and mobile payment, state-of-the-art implementations of TEE OS lack the support for multi-threading and are not suitable for computing-intensive workloads. This is because current TEE OSes are usually designed for hosting security critical tasks, which are typically small and noncomputing- intensive. Thus, most of TEE OSes do not support multi-threading in order to minimize the size of the trusted computing base (TCB). In this paper, we propose TEEp, a system that enables multi-threading in TEE without weakening security, and supports existing multi-threaded applications to run directly in TEE. Our design includes a novel multithreading mechanism based on the cooperation between the TEE OS and the host OS, without trusting the host OS. We implement our system based on OP-TEE and port it to two platforms: a HiKey 970 development board as mobile platform, and a Huawei Hi1610 ARM server as server platform. We run TensorFlow Lite on the development board and TensorFlow on the server for performance evaluation in TEE. The result shows that our system can improve the throughput of TensorFlow Lite on 5 models to 3.2x when 4 cores are available, with 13.5% overhead compared with Linux on average.

Session Chair

Yu Huang (Nanjing University)

Session Lunch-Break-2

Virtual Lunch Break

Conference
11:55 AM — 1:30 PM HKT
Local
Dec 2 Wed, 10:55 PM — 12:30 AM EST

Session Chair

N/A

Session G1

Scheduling and Resource Management

Conference
1:30 PM — 2:50 PM HKT
Local
Dec 3 Thu, 12:30 AM — 1:50 AM EST

Compressive Sensing based Predictive Online Scheduling with Task Colocation in Cloud Data Center

Yunhin Chan, Ke Luo and Xu Chen

0
With the growing size of the cloud data center, the high scheduling efficiency over massive-scale cloud servers is hard to achieve, particularly when the scheduler requires the full real-time cloud resource information for decision making. Moreover, most data centers only run latency-critical online services, resulting in low resource utilization. To solve these problems, we propose a Compressive Sensing based Predictive Online Scheduling (CSPOS) algorithm. To mitigate the bottleneck of transferring massive resource information of all cloud servers to the scheduler, we propose to transfer sampled data from a small subset of servers to the scheduler and recover the full cloud resource information by compressive sensing. We then propose a predictive online learning algorithm that efficiently colocates the online services and batch jobs, in order to boost the resource utilization of the data center. Our experiments show that the CSPOS model achieves outstanding scheduling efficiency under various settings and is able to greatly increase the resource usage of a data center. We also illustrate that the running time of the CSPOS model is very small and has negligible effects on the scheduling system.

Non-Technical Losses Detection in Smart Grids: An Ensemble Data-Driven Approach

Yufeng Xing, Lei Guo, Zongchao Xie, Lei Cui, Longxiang Gao and Shui Yu

0
Non technical losses (NTL) detection plays a crucial role in protecting the security of smart grids. Employing massive energy consumption data and advanced artificial intelligence (AI) techniques for NTL detection are helpful. However, there are concerns regarding the effectiveness of existing AI-based detectors against covert attack methods. In particular, the tampered metering data with normal consumption patterns may result in low detection rate. Motivated by this, we propose a hybrid datadriven detection framework. In particular, we introduce a wide & deep convolutional neural networks (CNN) model to capture the global and periodic features of consumption data. We also leverage the maximal information coefficient algorithm to analysis and detect those covert abnormal measurements. Our extensive experiments under different attack scenarios demonstrate the effectiveness of the proposed method.

Virtual Machine Consolidation for NUMA Systems: A Hybrid Heuristic Grey Wolf Approach

Kangli Hu, Weiwei Lin, Tiansheng Huang, Keqin Li and Like Ma

0
Virtual machines consolidation is known as a powerful means to reduce the number of activated physical machines (PMs), so as to achieve energy-saving for the data centers. Although the consolidation technique is widely studied in non-NUMA systems, we could only trace a few studies targeting NUMA systems. But the virtual machines (VMs) deployment of NUMA systems is quite different from that of non-NUMA systems. More specifically, consolidating VMs in NUMA systems need to decide both target physical machines and NUMA architectures to host the VMs, and more complicated constraints originated from the real usage of NUMA systems that need to be considered. Being motivated by these challenges, we in this paper formally derive the system model according to the real business model of NUMA systems and based on which, we propose a hybrid heuristics swarm intelligence optimization algorithm HHGWA for an efficient solution. To do the evaluation, extensive simulations that integrate real VM and PM information are conducted, the result of which indicates a superior performance of our proposed algorithm.

A Novel Classification Model to Predict Batch Job Failures in Co-located Cloud

Yurui Li, Weiwei Lin, Keqin Li, James Z. Wang, Fagui Liu and Jie Liu

0
Nowadays, cloud co-location is often used for data centers to improve the utilization of computing resources. However, batch jobs in a Co-location Datacenter (CLD) are vulnerable to failures due to the competition for limited resources with online service jobs. Such failed batch jobs would be rescheduled and failed repeatedly, resulting in the waste of computing resources and instability of the computing clusters. Therefore, we propose a method to accurately predict the potential failures of batch jobs for CLD. The core of the proposed method is STLF (SMOTE Tomek and LightGBM [5] Framework), which is divided into three parts. First, we use the co-feature extraction method to generate Co-located Feature Dataset (CLFD). Then SMOTE Tomek is used to oversampling the CLFD to ensure that the classifier can learn more minority features. Finally, we use LightGBM classifier to predict batch jobs�� failure. The performance experiments conducted on the Ali Trace 2018 dataset show that our proposed STLF significantly outperforms the existing popular classifiers in terms of the ROC curve, the area under the ROC curve (AUC), precision, and recall.

Session Chair

Zaipeng Xie (Hohai University)

Session G2

Distributed System Design and Implementation

Conference
1:30 PM — 2:50 PM HKT
Local
Dec 3 Thu, 12:30 AM — 1:50 AM EST

ABC: An Auction-Based Blockchain Consensus-Incentive Mechanism

Zhengpeng Ai, Yuan Liu and Xingwei Wang

0
The rapid development of blockchain technology and its various applications have attracted huge attention in the last five years. The consensus mechanism and incentive mechanism are the backbone of a blockchain network. The consensus mechanism plays a crucial role in sustaining the network security, integrity, and efficiency. The incentive mechanism motivates the distributed nodes to ��mine�� so as to participate the consensus mechanism. The existing mechanisms bear the fairness and justice issues. In this paper, from the perspective of mechanism design, we propose a consensus-incentive mechanism through applying continuous double auction theory, which is abbreviated as ABC mechanism. Our mechanism consists of four stages, including initiation stage, auction stage, completion stage, and confirmation stage. The auction model in use is the continuous double auction to ensure the transactions are stored in a real-time manner. Through extensive experimental evaluations, our mechanism is proven to improve the fairness and justice of the blockchain network.

A Trustworthy Blockchain-based Decentralised Resource Management System in the Cloud

Zhiming Zhao, Chunming Rong and Martin Gilje Jaatun

0
Quality Critical Decentralised Applications (QCDApp) have high requirements for system performance and service quality, involve heterogeneous infrastructures (Clouds, Fogs, Edges and IoT), and rely on the trustworthy collaborations among participants of data sources and infrastructure providers to deliver their business value. The development of the QCDApp has to tackle the low-performance challenge of the current blockchain technologies due to the low collaboration efficiency among distributed peers for consensus. On the other hand, the resilience of the Cloud has enabled significant advances in software-defined storage, networking, infrastructure, and every technology; however, those rich programmabilities of infrastructure (in particular, the advances of new hardware accelerators in the infrastructures) can still not be effectively utilised for QCDApp due to lack of suitable architecture and programming model.

DCVP: Distributed Collaborative Video Stream Processing in Edge Computing

Shijing Yuan, Jie Li, Chentao Wu, Yusheng Ji and Yongbing Zhang

0
In edge computing, computation offloading of video stream tasks and collaboration processing among edge nodes is a huge challenge. The previous research mainly focuses on the selection of computing modes and resource allocation, but taking no joint consideration of computation offloading and collaborative processing of edge node groups. In order to jointly tackle these issues in edge computing, we propose an innovative distributed collaborative video stream processing framework for edge computing(DCVP) where the video tasks are assigned to mobile edge computing (MEC) nodes or edge groups based on the offloading decision. First, we design a method for the group formation, which matches video subtasks to appropriate edge groups. In addition, we present two offloading modes for video streaming tasks, e.g., offloading to MEC nodes or edge groups, to handle computationally intensive video tasks. Furthermore, we formulate the joint optimization problem for offloading decision and collaborative processing of video subtasks into a distributed optimization problem. Finally, we employ an alternating direction method of multipliers (ADMM)-based algorithm to solve the problem. Simulation results under multiple parameters show the proposed schemes outperform other typical schemes.

Efficient Post-quantum Identity-based Encryption with Equality Test

Willy Susilo, Dung Hoang Duong and Huy Quoc Le

0
Public key encryption with equality test (PKEET) enables the testing whether two ciphertexts encrypt the same message. Identity-based encryption with equality test (IBEET) simplify the certificate management of PKEET, which leads to many potential applications such as in smart city applications or Wireless Body Area Networks. Lee et al. (ePrint 2016) proposed a generic construction of IBEET scheme in the standard model utilising a 3-level hierachy IBE together with a one-time signature scheme, which can be instantiated in lattice setting. Duong et al. (ProvSec 2019) proposed the first direct construction of IBEET in standard model from lattices. However, their scheme achieve CPA security only. In this paper, we improve the Duong et al.��s construction by proposing an IBEET in standard model which achieves CCA2 security and with smaller ciphertext and public key size.

Session Chair

Jia Liu (Nanjing University)

Session G3

Federated Learning and Deep Learning

Conference
1:30 PM — 2:30 PM HKT
Local
Dec 3 Thu, 12:30 AM — 1:30 AM EST

Robust Federated Learning Approach for Travel Mode Identification from Non-IID GPS Trajectories

Yuanshao Zhu, Shuyu Zhang, Yi Liu, Dusit Niyato and James J.Q. Yu

0
GPS trajectory is one of the most significant data sources in intelligent transportation systems (ITS). A simple application is to use these data sources to help companies or organizations identify users�� travel behavior. However, since GPS trajectory is directly related to private data (e.g., location) of users, citizens are unwilling to share their private information with the third-party. How to identify travel modes while protecting the privacy of users is a significant issue. Fortunately, Federated Learning (FL) framework can achieve privacy-preserving deep learning by allowing users to keep GPS data locally instead of sharing data. In this paper, we propose a Roust Federated Learning-based Travel Mode Identification System to identify travel mode without compromising privacy. Specifically, we design an attention augmented model architectures and leverage robust FL to achieve privacy-preserving travel mode identification without accessing raw GPS data from the users. Compared to existing models, we are able to achieve more accurate identification results than the centralized model. Furthermore, considering the problem of non-Independent and Identically Distributed (non-IID) GPS data in the realworld, we develop a secure data sharing strategy to adjust the distribution of local data for each user, thereby the proposed model with non-IID data can achieve accuracy close to the distribution of IID data. Extensive experimental studies on a real-world dataset demonstrate that the proposed model can achieve accurate identification without compromising privacy and being robust to real-world non-IID data.

Deep Spatio-Temporal Attention Model for Grain Storage Temperature Forecasting

Shanshan Duan, Weidong Yang, Xuyu Wang, Shiwen Mao and Yuan Zhang

0
Temperature is one of the major ecological factors that affect the safe storage of grain. In this paper, we propose a deep spatio-temporal attention mode to predict stored grain temperature, which exploits the historical temperature data of stored grain and the meteorological data of the region. In this proposed model, we use the Sobel operator to extract the local spatial factors, and leverage the attention mechanism to obtain the global spatial factors of grain temperature data and temporal information. In addition, a convolutional neural network (CNN) is used to learn features of external meteorological factors. Finally, the spatial factors of grain pile and external meteorological factors are combined to predict future grain temperature using long short-term memory (LSTM) based encoder and decoder models. Experiment results show that the proposed model achieves higher predication accuracy compared with the traditional methods.

Proactive Content Caching for Internet-of-Vehicles based on Peer-to-Peer Federated Learning

Zhengxin Yu, Jia Hu, Geyong Min, Han Xu and Jed Mills

0
To cope with the increasing content requests from emerging vehicular applications, caching contents at edge nodes is imperative to reduce service latency and network traffic on the Internet-of-Vehicles (IoV). However, the inherent characteristics of IoV, including the high mobility of vehicles and restricted storage capability of edge nodes, cause many difficulties in the design of caching schemes. Driven by the recent advancements in machine learning, learning-based proactive caching schemes are able to accurately predict content popularity and improve cache efficiency, but they need gather and analyse users�� content retrieval history and personal data, leading to privacy concerns. To address the above challenge, we propose a new proactive caching scheme based on peer-to-peer federated deep learning, where the global prediction model is trained from data scattered at vehicles to mitigate the privacy risks. In our proposed scheme, a vehicle acts as a parameter server to aggregate the updated global model from peers, instead of an edge node. A dual-weighted aggregation scheme is designed to achieve high global model accuracy. Moreover, to enhance the caching performance, a Collaborative Filtering based Variational AutoEncoder model is developed to predict the content popularity. The experimental results demonstrate that our proposed caching scheme largely outperforms typical baselines, such as Greedy and Most Recently Used caching.

Session Chair

Bolei Zhang (Nanjing University of Posts and Telecommunications)

Session Break-2

Virtual Break

Conference
2:50 PM — 3:00 PM HKT
Local
Dec 3 Thu, 1:50 AM — 2:00 AM EST

Session Chair

N/A

Session H1

Workshop — Big Data Systems

Conference
3:00 PM — 4:20 PM HKT
Local
Dec 3 Thu, 2:00 AM — 3:20 AM EST

How Fast Can We Insert? An Empirical Performance Evaluation of Apache Kafka

Guenter Hesse, Christoph Matthies and Matthias Uflacker

0
Message brokers see widespread adoption in modern IT landscapes, with Apache Kafka being one of the most employed platforms. These systems feature well-defined APIs for use and configuration and present flexible solutions for various data storage scenarios. Their ability to scale horizontally enables users to adapt to growing data volumes and changing environments. However, one of the main challenges concerning message brokers is the danger of them becoming a bottleneck within an IT architecture. To prevent this, knowledge about the amount of data a message broker using a specific configuration can handle needs to be available. In this paper, we propose a monitoring architecture for message brokers and similar Java Virtual Machine-based systems. We present a comprehensive performance analysis of the popular Apache Kafka platform using our approach. As part of the benchmark, we study selected data ingestion scenarios with respect to their maximum data ingestion rates. The results show that we can achieve an ingestion rate of about 420,000 messages/second on the used commodity hardware and with the developed data sender tool.

Performance Modeling and Tuning for DFT Calculations on Heterogeneous Architectures

Hadia Ahmed, David Williams-Young, Khaled Z. Ibrahim and Chao Yang

0
Tuning scientific code for heterogeneous computing architecture is a growing challenge. Not only do we need to tune the code to multiple architectures, but also we need to select or schedule computations to the most efficient compute variant. In this paper, we explore the tuning and performance modeling question of one of the most time computing kernels in density functional theory calculations on systems multicore host CPU accelerated with GPUs. We show the problem configuration dictates the choice of the most efficient compute engine. Such choice could alternate between the host and the accelerator, especially while scaling. As such, a performance model to predict the execution time on the host CPU and GPU is essential to select the compute environment and to achieve optimal performance. We present a simple model that empirically carry out such task and could accurately steer the scheduling of computation.

Graph-based Approaches for the Interval Scheduling Problem

Panagiotis Oikonomou, Nikos Tziritas, Georgios Theodoropoulos, Maria Koziri, Thanasis Loukopoulos and Samee U. Khan

0
One of the fundamental problems encountered by large-scale computing systems, such as clusters and cloud, is to schedule a set of jobs submitted by the users. Each job is characterized by resource demands, as well as start and completion time. Each job must be scheduled to execute on a machine having the required capacity between the start and completion time (referred as interval) of the job. Each machine is defined by a parallelism parameter g that indicates the maximum number of jobs that can be processed by the machine, in parallel. The above problem is referred to as the interval scheduling problem with bounded parallelism. The objective is to minimize the total busy time of all machines. Majority of the solutions proposed in the literature consider homogeneous set of jobs and machines that is a simplified assumption as in practice, heterogeneous jobs and machines are frequently encountered. In this article, we tackle the aforesaid problem with a set of heterogeneous jobs and machines. A major contribution of our work is that the problem is addressed in a novel way by combining a graphbased approach and a dynamic programming approach which is based on a variation of bin packing problem. A greedy algorithm is also proposed by employing only a graph-based approach at the aim to reduce the computational complexity. Experimental results show that the proposed algorithms can significantly reduce the cumulative busy interval over all machines compared with state-of-the-art algorithms proposed in the literature.

MPI parallelization of NEUROiD models using docker swarm

Raghu Sesha Iyengar and Mohan Raghavan

0
NEURON along with other systems simulators is increasingly being used to simulate neural systems where the complexity demands massive parallel implementations. NEURON��s ParallelContext allows parallelizing models using MPI. However, when using NEURON models in a docker container, this parallelization does not work out-of-the-box. We propose an architecture for MPI parallelization of NEURON models using docker swarm. We integrate this on our NEUROiD platform and obtain almost 16x improvement in simulation time on our cluster.

Session Chair

Yuanyuan Xu (Hohai University)

Session H2

Workshop — Edge Intelligence for Smart IoT Applications

Conference
3:00 PM — 5:20 PM HKT
Local
Dec 3 Thu, 2:00 AM — 4:20 AM EST

S-GAT: Accelerating Graph Attention Networks Inference on FPGA Platform with Shift Operation

Weian Yan, Weiqin Tong, and Xiaoli Zhi

1
Deep learning has been successful in many fields such as acoustics, image, and natural language processing. However, due to the unique characteristics of graphs, deep learning using universal graph data is not easy. The Graph Attention Networks (GATs) show the best performance in multiple authoritative node classification benchmark tests (including transductive and inductive). The purpose of this research is to design and implement an FPGA-based accelerator called S-GAT for graph attention networks that achieves excellent performance on acceleration and energy efficiency without losing accuracy, and does not rely on DSPs and large amounts of on-chip memory. We design S-GAT with software and hardware co-optimization. Specifically, we use model compression and feature quantization to reduce the model size, and use shift addition units (SAUs) to convert multiplication into shift operation to further reduce the computation requirements. We integrate the above optimizations into a universal hardware pipeline for various structures of GATs. At last, we evaluate our design on an Inspur F10A board with an Intel Arria 10 GX1150 and 16 GB DDR3 memory. Experimental results show that S-GAT can achieve 7.34 times speedup over Nvidia Tesla V100 and 593 times over Xeon CPU Gold 5115 while maintaining accuracy, and 48 times and 2400 times on energy efficiency respectively.

Explainable Congestion Attack Prediction and Software-level Reinforcement in Intelligent Traffic Signal System

Xiaojin Wang, Yingxiao Xiang, Wenjia Niu, Endong Tong, and Jiqiang Liu

1
With connected vehicle(CV) technology, the nextgeneration transportation system is stepping into its implementation phase via the deployment of Intelligent Traffic Signal System (I-SIG). Since the congestion attack was firstly discovered in USDOT (U.S. Department of Transportation) sponsored ISIG, deployed in three cities including New York, such realistic threat opens a new security issue. In this work, from machine learning perspective, we perform a systematic feature analysis on congestion attack and its variations from last vehicle of different traffic flow pattern. We first adopt the Tree-regularized Gated Recurrent Unit(TGRU) to make explainable congestion attack prediction, in which 32-dimension features are defined to character a 8-phase intersection traffic. We then develop corresponding software-level security reinforcements suggestions, which can be further expanded as an important work. In massive experiments based on real-world intersection settings, we eventually distill 384 samples of congestion attacks to train a TGRU-based attack prediction model, and achieve an average 80% precision. We further discussed possible reinforcement defense methods according to our prediction model.

Dynamic-Static-based Spatiotemporal Multi-Graph Neural Networks for Passenger Flow Prediction

Jingyan Ma, Jingjing Gu, Qiang Zhou, Qiuhong Wang and Ming Sun

0
Various sensing and computing technologies have gradually outlined the future of the intelligent city. Passenger flow prediction of public transports has become an important task in Intelligent Transportation System (ITS), which is the prerequisite for traffic management and urban planning. There exist many methods based on deep learning for learning the spatiotemporal features from high non-linearity and complexity of traffic flows. However, they only utilize temporal correlation and static spatial correlation, such as geographical distance, which is insufficient in the mining of dynamic spatial correlation. In this paper, we propose the Dynamic-Staticbased Spatiotemporal Multi-Graph Neural Networks model (DSSTMG) for predicting traffic passenger flows, which can concurrently incorporate the temporal and multiple static and dynamic spatial correlations. Firstly, we exploit the multiple static spatial correlations by multi-graph fusion convolution operator, including adjacent relation, station functional zone similarity and geographical distance. Secondly, we exploit the spatial dynamic correlations by calculating the similarity between the flow pattern of stations over a period of time, and build the dynamic spatial attention. Moreover, we use time attention and encoder-decoder architecture to capture temporal correlation. The experimental results on two realworld datasets show that the proposed DSSTMG outperforms state-of-the-art methods.

Incentive-driven Data Offloading and Caching Replacement Scheme in Opportunistic Mobile Networks

Tong Wu, Xuxun Liu, Deze Zeng, Huan Zhou and Shouzhi Xu

0
Offloading cellular traffic through Opportunistic Mobile Networks (OMNs) is an effective way to relieve the burden of cellular networks. Providing data offloading services requires a lot of resources, and nodes in OMNs are selfish and rational, they are not willing to provide data offloading services for others without any compensation. Therefore, it is urgent to design an incentive mechanism to stimulate mobile nodes to participate in data offloading process. In this paper, we propose a Reverse A uction-based Incentive Mechanism to stimulate mobile nodes in OMNs to provide data offloading services, and take the cache management into consideration. We model the incentive-driven data offloading process as a non-linear integer programming problem, then a Greedy Helper Selection Method (GHSM) and a Caching Replacement Scheme (CRS) are proposed to solve the problem. In addition, we also propose an innovative payment rule based on the Vickrey-Clarke-groves (VCG) model to ensure the individual rationality and authenticity of the proposed algorithm. Trace-driven simulation results show that the proposed algorithm can reduce the cost of Content Service Provider (CSP) significantly in different scenarios.

Adaptive DNN Partition in Edge Computing Environments

Weiwei Miao, Zeng Zeng, Lei Wei, Shihao Li, Chengling Jiang and Zhen Zhang

0
Deep Neural Network (DNN) has been applied widely nowadays, making remarkable achievements in a wide variety of research fields. With the improvement of the accuracy requirements for the inference results, the topology of DNN tends to be more and more complex, evolving from chain topology to directed acyclic graph (DAG) topology, which leads to the huge amount of computation. For those end devices which have limited computing resources, the delay of running DNN models independently may be intolerable. As a solution, edge computing can make use of all available devices in the edge computing environments comprehensively to run DNN inference tasks, so as to achieve the purpose of acceleration. In this case, how to split DNN inference task into several small tasks and assign them to different edge devices is the central issue. This paper proposes a load-balancing algorithm to split DNN with DAG topology adaptively according to the environment. Extensive experimental results show the the propose adaptive algorithm can effectively accelerate the inference speed.

Efficient Edge Service Migration in Mobile Edge Computing

Zeng Zeng, Shihao Li, Weiwei Miao, Lei Wei, Chengling Jiang, Chuanjun Wang and Mingxuan Zhang

0
Edge computing is one of the emerging technologies aiming to enable timely computation at the network edge. With virtualization technologies, the role of the traditional edge providers is separated into two: edge infrastructure providers (EIPs), who manage the physical edge infrastructure, and edge service providers (ESPs), who purchase slices of physical resources (e.g., CPU, bandwidth, memory space, disk storage) from EIPs and then cache service entities to offer their own value-added services to end users. These value-added services are also called virtual network function or VNF. As we know, edge computing environments are dynamic, and the requirements of edge service for computing resources usually fluctuate over time. Thus, when the demand of a VNF cannot be satisfied, we need to design the strategies for migrating the VNF so as to meet its demand and retain the network performance. In this paper, we concentrate on migrating VNFs efficiently (MV), such that the migration can meet the bandwidth requirement for data transmission. We prove that MV is NP-complete. We present several exact and heuristic solutions to tackle it. Extensive simulations demonstrate that the proposed heuristics are efficient and effective.

A protocol-independent container network observability analysis system based on eBPF

Chang Liu, Zhengong Cai, Bingshen Wang, Zhimin Tang, and Jiaxu Liu

0
Technologies such as microservices, containerization and Kubernetes in cloud-native environments make large-scale application delivery easier and easier, but problem troubleshooting and fault location in the face of massive applications is becoming more and more complex. Currently, the data collected by the mainstream monitoring technologies based on sampling is difficult to cover all anomalies, and the kernel's lack of observability also makes it difficult to monitor more detailed data in container environments such as the Kubernetes platform. In addition, most of the current technology solutions use tracing and application performance monitoring tools (APMs), but these technologies limit the language used by the application and need to be invasive into the application code, many scenarios require more general network performance detection diagnostic methods that do not invade the user application. In this paper, we propose to introduce network monitoring at the kernel level below the application for the Kubernetes cluster in Alibaba container service. By nonintrusive collection of user application L7/L4 layer network protocol interaction information based on eBPF, data collection of more than 10M throughputs per second can be achieved without modifying any kernel and application code, while the impact on the system application is less than 1%. It also uses machine learning methods to analyze and diagnose application network performance and problems, analyze network performance bottlenecks and locate specific instance information for different applications, and realize protocol-independent network performance problem location and analysis.

Session Chair

Yanchao Zhao (Nanjing University of Aeronautics and Astronautics) and Sheng Zhang (Nanjing University)

Session H3

Workshop — Heterogeneous Multi-access Mobile Edge Computing and Applications

Conference
3:00 PM — 4:40 PM HKT
Local
Dec 3 Thu, 2:00 AM — 3:40 AM EST

Performance Guaranteed Single Link Failure Recovery in SDN Overlay Networks

Lilei Zheng, Hongli Xu, Suo Chen and Liusheng Huang

0
An SDN overlay network is a legacy network improved through SDN and overlay technology. It has some traits including the cheap upgrade cost, flexible network management and the sharing of physical network resources which has brought huge benefits to the multi-tenant cloud platform. Link failure is an important issue that shoulde be solved in any large network. In SDN overlay networks, link failure recovery brings new challenges different from the legacy network, such as how to maintain the performance of overlay networks in the post-recovery network. Thus, in the case of single link failure, we devise a recovery approach to guarantee the performance of overlay networks by the coordination between SDN switches and traditional switches. We formulate the link failure recovery (LFR) problem as an integer linear program and prove its NP-hardness. A roundingbased algorithm with bounded approximation factors is devised to solve the LFR problem. The simulation results show that the devised scheme can guarantee the performance of the overlay network after restoration. The results also show that, compared with SPR and IPFRR, the designed method can reduce the maximum link load rate by approximately 41.5% and 51.6%.

A Personal Distributed Real-time Collaborative System

Michalis Konstantopoulos, Nikos Chondros and Mema Roussopoulos

0
In this paper, we present O3REAL, a privacypreserving distributed middleware for real-time collaborative editing of documents. O3REAL introduces a novel approach for building peer-to-peer real-time collaborative applications, using a reliable broadcast channel mechanism for network communication, but at the same time provides for persistent storage management of collaborative documents using the filesystem interface of a POSIX compliant filesystem. This approach enables real-time, completely decentralized collaboration among users, without the need for a third party to intervene, and significantly simplifies the creation of peer-to-peer collaborative applications. We demonstrate that O3REAL scales well for real-time collaboration use-cases. For example, with 33 users simultaneously collaborating on a document in real time over a WAN with a 50 ms link delay, the average perceived latency is approximately 54 ms, which is very close to the optimal baseline. In comparison, Etherpad exhibits nearly twice the perceived latency.

Cooperative Resource Sharing Strategy With eMBB Cellular and C-V2X Slices

Yan Liang, Xin Chen, Shuang Chen and Ying Chen

0
The emerging fifth generation (5G) wireless technologies support services with huge heterogeneous requirements. Network slicing technology can compose multiple logical networks and allocate wireless resources according to the needs of each user, which can reduce the cost of hardware and network resources. Nevertheless, considering how systems containing different types of users reduce the cost of resources remains challenging. In this paper, we study the system cost of two types of user groups requesting resource blocks (RBs) at the radio access network (RAN), which are the enhanced mobile broadband (eMBB) cellular user group and the cellular vehicle to everything (C-V2X) user group. In order to improve the rational utilization, we make dynamic resource pricing according to the needs of users. Then, we propose a Cooperative Resource Sharing (CRS) Algorithm, which makes two user groups jointly purchase and share resources. The simulation results show that the strategy used in this algorithm can effectively reduce the unit price of RB and minimize the total cost of the system.

CP-BZD Repair Codes Design for Distributed Edge Computing

Shuangshuang Lu, Chanting Zhang and Mingjun Dai

0
In edge computing applications, data is distributed across several nodes. Failed nodes mean losing part of the data which may hamper edge computing. Node repair is needed for frequent nodes failure in edge computing systems. Codes with both the combination property (CP) and Binary Zigzag Decodable (BZD) are referred to as CP-BZD codes. In this paper, without adding extra checking bits, new coding constructions of CP-BZD codes are proposed to repair the failed node in distributed storage systems. All constructed codes can be decoded by the zigzag-decoding algorithm. Numerical analysis shows that compared with the original CP-BZD codes, our proposed schemes obtain better repair efficiency.

Computation Task Scheduling and Offloading Optimization for Collaborative Mobile Edge Computing

Bin Lin, Xiaohui Lin, Shengli Zhang, Hui Wang and Suzhi Bi

0
Mobile edge computing (MEC) platform allows its subscribers to utilize computational resource in close proximity to reduce the computation latency. In this paper, we consider two users each has a set of computation tasks to execute. In particular, one user is a registered subscriber that can access the computation service of MEC platform, while the other unregistered user cannot directly access the MEC service. In this case, we allow the registered user to receive computation offloading from the unregistered user, compute the received task(s) locally or further offload to the MEC platform, and charge a fee that is proportional to the computation workload. We study from the registered user��s perspective to maximize its total utility that balances the monetary income and the cost on execution delay and energy consumption. We formulate a mixed integer non-linear programming (MINLP) problem that jointly decides the execution scheduling of the computation tasks (i.e., the device where each task is executed) and the computation/communication resource allocation. To tackle the problem, we first derive the closed-form solution of the optimal resource allocation given the integer task scheduling decisions. We then propose a reduced-complexity approximate algorithm to optimize the combinatorial computation scheduling decisions. Simulation results show that the proposed collaborative computation scheme effectively improves the utility of the helper user compared with other benchmark methods, and the proposed solution method approaches the optimal solution within 0.1% average performance gap with significantly reduced complexity.

Session Chair

Yuan Wu (University of Macau)

Session Closing

Conference Closing

Conference
5:20 PM — 5:30 PM HKT
Local
Dec 3 Thu, 4:20 AM — 4:30 AM EST

Session Chair

To Be Determined

Made with in Toronto · Privacy Policy · © 2020 Duetone Corp.