Collaborative Intelligence: Challenges and Opportunities
Abstract
This paper presents an overview of the emerging area of collaborative intelligence (CI). Our goal is to raise awareness in the signal processing community of the challenges and opportunities in this area of growing importance, where key developments are expected to come from signal processing and related disciplines. The paper surveys the current state of the art in CI, with special emphasis on signal processing-related challenges in feature compression, error resilience, privacy, and system-level design.
Index Terms— Collaborative intelligence, feature compression, feature error resilience, distributed inference
1 Introduction
Artificial Intelligence (AI) is moving from research labs to the real world. One of the promising avenues for bringing AI “to the edge” is Collaborative Intelligence (CI), a framework in which AI inference is shared between the edge devices and the cloud. In CI, typically, the front-end of an AI model is deployed on an edge device, where it performs initial processing and feature computation. These intermediate features are then sent to the cloud, where the back-end of the AI model completes the inference, as shown in Fig. 1. Variations on this basic model are possible, where multiple edge devices or multiple inference engines are involved.
CI has been shown to have the potential for energy and latency savings compared to the more typical cloud-based or fully edge-based AI model deployment [1, 2], but it also introduces new challenges, which require new science and engineering principles to be developed in order to achieve optimal designs. In CI, a capacity-limited channel is inserted in the information pathway of an AI model. This necessitates compression of features computed at the edge sub-model. Moreover, errors introduced into features due to channel imperfections would need to be handled at the cloud side in order to perform successful inference. Finally, issues related to the privacy of transmitted data need to be addressed.
This paper presents an overview of signal processing-related challenges an opportunities brought by CI. Section 2 presents an overview of the existing work and future challenges in deep intermediate feature compression. Section 3 discusses error resilience of intermediate features to bit errors and packet loss, while Section 4 focuses on privacy aspects. Section 5 discusses challenges in system-level design of CI systems. Finally, Section 6 concludes the paper.

2 Intermediate feature compression
Due to the capacity-limited channel in the information path of a CI model, intermediate features that are being transferred from the edge to the cloud need to be compressed. Feature compression in itself is not a new topic; earlier work on Compact Descriptors for Visual Search (CDVS) [3] and Compact Descriptors for Video Analysis (CDVA) [4] have considered compression of (mostly handcrafted) features for visual analytics, especially matching, search, and retrieval. Such features are ultimate ones, because they can be directly used in the cloud (without the need of significant back-end processing as in Fig. 1). On the other hand, in CI, the focus is on learned, intermediate features, and the applications range beyond visual analytics.
A new scheme of coding and transmitting intermediate deep learnt features instead of sensor data or the aforementioned ultimate features is emerging recently [5, 6, 7], where deep neural networks are split into the edge part and the cloud one respectively. A task is performed in an edge-cloud collaborative manner. The generic backbone networks with most computing burden in visual analysis may be located on the edge to extract intermediate features, while the task-specific components are then assigned to the cloud. The load balancing can be achieved in a flexible and task-specific manner. A generic deep learning architecture can be deployed to reduce the cost of edge devices. To enable the new paradigm, compression and coding techniques for intermediate deep learning features are needed.
Lossless compression is not practical for intermediate deep feature coding [8], and most existing works have employed conventional video codecs to perform lossy feature compression. Research in [6, 7] has performed lossy intermediate deep feature compression for the object detection task, utilizing traditional image/video codecs including PNG, JPEG, JPEG2000, VP9 and HEVC. A coding framework with three modules (PreQuantization, Repack, and VideoEncoder) has been devised [5, 8] to encode intermediate deep features and demonstrated the compression performance on three computer vision tasks (image classification, image retrieval, and visual object detection). In addition, there is investigation adopting multi-task learning [9].
The optimization goal of video coding is to minimize signal loss under certain bitrate constraints. In contrast, feature coding’s optimization goal is to minimize the loss of analysis and recognition at a given bit rate. Up to date, solutions toward deep video feature compression, with the analysis or retrieval performance being maintained, still remains open. A rate-performance-loss optimization model [10] is to make a tradeoff between required bitrate and analysis/retrieval performance. It is similar to the Rate-Distortion Optimization (RDO) in video coding but with the performance loss as the distortion term. To take advantages of temporal redundancy in deep features for improved compression performance, three types of features are defined according to potential coding modes: Independently-coded feature (I-feature); Predictively-coded feature (P-feature); and Skip-coded feature (S-feature). Following this, a joint coding framework for local and global deep features extracted from videos was proposed, in which local feature coding can be accelerated by making use of the frame types determined with global features, while the decoded local features can be used as a reference for global feature coding [11]. The authors in [12] introduced feature and texture compression in a scalable coding framework, where the base layer is catered for the deep learning feature while the enhancement layer aims to reconstruct the texture. Towards collaborative compression and intelligent analytics, Video Coding for Machines (VCM) [13, 14] bridges the gap between feature coding for machine vision and video coding for human vision.
The feature (instead of the whole visual signal) compression enables more accurate visual features since they are extracted from original (rather than decoded) visual signal. As noted earlier, there is flexibility of computational load balancing between edge devices and cloud servers. There are fewer privacy concerns because not the whole video is coded and transmitted, as discussed further in Section 4. There has been initial standardization effort for intermediate feature compression, as noted above: MPEG has started an ad-hoc working group on VCM, and AVS (Audio Video Coding Standard of China) has had a visual feature coding working group for intermediate deep learnt feature compression earlier [15].
The coding framework described above adopts traditional video codecs, which do not closely match the feature data distribution since they have been designed and optimized for natural images and videos; and some coding tools are inevitably redundant in feature compression, consuming unnecessary computing resource. For the next steps, more coding tools designed for intermediate deep features are anticipated.
Traditional coding methods aim for best video quality under certain bit-rate constraint for human consumption while the said new framework allows to extract useful information directly for machine analysis. Hence, joint optimization of video coding and feature coding can be investigated in the future. We can explore how to use both feature bitstream and video bitstream to reconstruct high quality videos for human or/and machine intelligence.
As to the deep learning models, coding-aware backbone networks and task-specific model design is another meaningful research direction. In particular, coding-aware backbone networks at the edge side should be able to generate features that are robust against lossy compression. The coding-aware task-specific model at the cloud side should maintain high performance with possibly-corrupted features. For instance, coding-aware feature generation has been achieved in [16].
3 Error and loss resilience
Besides being capacity-limited, the communication channel will introduce errors into the information path of a CI model. At the physical layer, bit errors will be observed in the transmitted data. At the network/transport layer, packet loss will be introduced into the transmitted data. Error control methods will need to be designed to handle these impairments.
From the existing work on feature compression, it appears that intermediate features can tolerate some degree of quantization noise without much degradation of the model’s accuracy. However, resilience of intermediate features to bit errors and packet loss, and the corresponding error control methods, are currently largely unexplored. Among the few existing works, [17, 18] studied joint source-channel coding of intermediate features, [19] explored simple interpolation methods for recovering features missing due to packet loss, while [20] proposed low-rank tensor completion for this purpose. All these methods have considered single-image input to the CI system, rather than video input. There are many open issues here, revolving around finding the best ways to exploit various redundancies (spatial, temporal, inter-channel) in the feature domain, to come up with effective data recovery methods.
One could also imagine developing unequal error protection schemes, where the level of protection would be tailored to the importance of specific features in relation to the model accuracy. Gauging the feature importance may be achieved using attention mechanisms [21], while the actual error protection may be implemented via channel codes of varying strength, unequal power allocation, or by matching features of different importance with appropriate network traffic management policies and Quality of Service (QoS) schemes.
4 Privacy
From the privacy point of view, an inherent advantage of CI systems over more traditional cloud-based analytics is that the original input signal never leaves the edge device; only intermediate features are transmitted for further processing. These intermediate features may or may not resemble input images, but even if they look very different from input images, this does not mean that it is impossible to reconstruct some private information from them. Privacy is an extensively studied topic in machine/deep learning [22]. However, much of the work in this area is focused on learning, whereas CI approach requires inference-time privacy. Based on the classification presented in [22], CI privacy would be most related to model inversion and attribute inference type of attacks.

Fig. 2, adapted from [7], shows input images reconstructed from 7, 11, and 17 layer of YOLOv2 [23]. Input reconstruction was accomplished using a trained “mirror model,” which has the same architecture as the front-end model up to the given layer, but with pooling layers replaced by upsampling, and convolutions replaced by transpose convolutions. Although this is not necessarily the best architecture for input reconstruction, it does show that fairly decent reconstruction, potentially revealing private information, is possible from the 7 layer features of YOLOv2, and perhaps 11 layer as well in some cases. Further, it should be noted that full input reconstruction may not be necessary for privacy to be breached; for example, someone’s identity may be revealed via the shape of heir head or the silhouette of their haircut, even without the fine details of their facial features.
Hence, the privacy problem in CI systems is much more involved than simply avoiding transmission of the input signal. However, since the intermediate features are learnt, this creates opportunities to construct a learning process that would result in features that are privacy-friendly in addition to being able to support the main analytics task(s). One way to accomplish this would be to develop loss functions that would increase the uncertainty (entropy) about private information while reducing (or trading off) the error(s) on the main task(s). Another approach would be to create noise to be added to the intermediate features to improve their privacy [24]. One could also imagine an adversarial learning strategy [25], where the “discriminator” would try to infer private information from intermediate features, while the “generator” would try to create features that protect it.
5 System-level design
Compared with the traditional cloud-centered data processing mode, the CI system faces many challenges for big data processing, such as latency constraints, and memory occupation. Big data processing for CI relies on huge computing resources. The traditional cloud-centered processing architecture brings an enormous burden to the cloud, especially with the high computational complexity of deep neural networks (DNNs). Meanwhile, AI-driven devices are increasingly expected to perform various related tasks by running multiple homogeneous DNNs simultaneously. It is a big challenge to deploy these networks on resource-constrained devices.
Typically, a DNN can be divided into several modules. In this way, part of the computation of DNN modules can be migrated to the front-end devices that are equipped with graphic processing units (GPUs), which can alleviate the cloud burden and shorten the processing latency. A DNN partition method was introduced by Kang et al. [1] to improve the efficiency and throughput of the whole system. A scheduler was developed to partition the DNN computation between mobile devices and data centers automatically. Eshratifar et al. [16] presented an intermediate feature compression method for the dynamic DNN partition. An inference computation allocation method among multi-layer Internet of things systems was introduced in [26], in which all the videos still need to be sent to the cloud servers. Zhang et al. [27] provided a task-level allocation method to shorten the end-to-end latency.
For the multi-task memory storage reduction, sharing information on different networks provides a promising way to reduce the sizes of multiple correlated models. Therefore, how to optimize the storage usage of multi-task homogeneous networks without significant performance degradation in each task is a tricky problem. It is necessary to share parts of different networks to remove inter-redundancy information. Most studies focus on compressing a single deep network [28, 29, 30, 31], although removing the intra-redundancy within every single model is important for their usage on the resource-constrained devices. An intuitive solution to reduce the inter-redundancy is multi-task learning techniques [32, 33]. Multi-task learning aims to improve generalization by sharing representations between related tasks. A commonly-used multi-task learning approach is to share the hidden layers between all tasks while preserving several task-specific output layers, or even use a single shared architecture [34, 35, 36]. Recently, more works pay attention to develop more effective mechanisms for multi-task learning by learning additional task-specific parameters for new tasks. Additional cross-stitch units was introduced in [32] to allow features to be shared across tasks. In [37], a residual adapter module was proposed to enable a high degree of parameter sharing between domains.The authors in [38] learned new filters that are linear combinations of existing filters for new tasks. A Multi-Task Attention Network consisting of two major components was proposed in [39]: a single shared network learning a global feature pool and a soft-attention module for each task, allowing for automatic learning of both task-shared and task-specific features.
Most of the multi-task learning methods adopt pre-defined shared structures [40, 41, 34], which may result in performance degradation due to improper sharing of knowledge. Some approaches learn what to share across tasks from pre-trained models [42, 43, 44]. However, such methods suffer from some limitations: 1) The dependence on pre-trained models makes these methods lacking in versatility and adaptability to different initialization of models. 2) Additional training overhead is required to obtain the pre-trained models when some pre-trained models are not available.
Meanwhile, some other popular methods [42, 43, 44] have been explored to share parameters among pre-trained models. Multiple well-trained DNNs are merged in [42] and [43] via weight sharing directly. A branched multi-task structure was proposed [44] by leveraging the employed task affinities for layer sharing among pre-trained networks. However, these methods are over-reliant on the pre-trained models, resulting in a lack of adaptability to different initialization models and additional training overhead.
6 Conclusion
In this paper we presented a brief overview of signal pro-cessing-related challenges and opportunities in collaborative intelligence. These were grouped into areas where we felt the signal processing community will make key contributions: feature compression, error resilience, privacy, and system-level design. Many other important areas, such as lightweight/reduced-precision deep models, embedded DNN system development, feature communication and scheduling, could not be included due to space constraints, but in those areas too, signal processing is expected to provide key insights and solutions. We look forward to seeing the field grow, and signal processing playing an important role in it.
7 Acknowledgement
Prof. Tian’s work is partially supported by grants from the National Natural Science Foundation of China under contract No. 61825101.
References
- [1] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in Proc. 22nd ACM Int. Conf. Arch. Support Programming Languages and Operating Syst., 2017, pp. 615–629.
- [2] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Trans. Mobile Computing, 2019, Early Access.
- [3] L. Duan, V. Chandrasekhar, J. Chen, J. Lin, Z. Wang, T. Huang, B. Girod, and W. Gao, “Overview of the MPEG-CDVS standard,” IEEE Trans. Image Processing, vol. 25, no. 1, pp. 179–194, 2016.
- [4] L. Duan, Y. Lou, Y. Bai, T. Huang, W. Gao, V. Chandrasekhar, J. Lin, S. Wang, and A. C. Kot, “Compact descriptors for video analysis: The emerging MPEG standard,” IEEE MultiMedia, vol. 26, no. 2, pp. 44–54, 2019.
- [5] Z. Chen, K. Fan, S. Wang, L. Duan, W. Lin, and A. C. Kot, “Toward intelligent sensing: Intermediate deep feature compression,” IEEE Trans. Image Processing, vol. 29, pp. 2230–2243, 2019.
- [6] H. Choi and I. V. Bajić, “Deep feature compression for collaborative object detection,” in Proc. IEEE ICIP’18, 2018.
- [7] H. Choi and I. V. Bajić, “Near-lossless deep feature compression for collaborative intelligence,” 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6, 2018.
- [8] Z. Chen, K. Fan, S. Wang, L. Duan, W. Lin, and A. C. Kot, “Lossy intermediate deep learning feature compression and evaluation,” 27th ACM International Conference on Multimedia,, pp. 2414–2422, 2019.
- [9] S. R. Alvar and I. V. Bajić, “Multi-task learning with compressible features for collaborative intelligence,” in Proc. IEEE ICIP’19, 2019, pp. 1705–1709.
- [10] L. Ding, Y. Tian, H. Fan, Y. Wang, and T. Huang, “Rate-performance-loss optimization for inter-frame deep feature coding from videos,” IEEE Trans. Image Processing, vol. 26, pp. 5743–5757, 2017.
- [11] L. Ding, Y. Tian, H. Fan, C. Chen, and T. Huang, “Joint coding of local and global deep features in videos for visual search,” IEEE Trans. Image Processing, vol. 29, pp. 3734–3749, 2020.
- [12] S. Wang, S. Wang, W. Yang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Towards analysis-friendly face representation with scalable feature and texture compression,” arXiv:2004.10043, Apr. 2020.
- [13] MPEG AHG VCM, “Video coding for machines,” ITU-T Q12/16, https://www.itu.int/en/ITU-T/Workshops-and-Seminars/ 20191008/Documents/Yuan_Zhang_Presentation.pdf, 2019.
- [14] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,” IEEE Transactions on Image Processing, vol. 29, pp. 8680–8695, 2020.
- [15] S. Ma, “Investigation report on AI-related coding techniques,” Audio Video Coding Standard (AVS) document AVS N2510, 63rd AVS Meeting, 2017.
- [16] A. E. Eshratifar, A. Esmaili, and M. Pedram, “Towards collaborative intelligence friendly architectures for deep learning,” in 20th International Symposium on Quality Electronic Design (ISQED), 2019.
- [17] K. Choi, K. Tatwawadi, A. Grover, T. Weissman, and S. Ermon, “Neural joint source-channel coding,” in Proc. ICML, Jun. 2019, pp. 1182–1192.
- [18] J. Shao and J. Zhang, “Bottlenet++: An end-to-end approach for feature compression in device-edge co-inference systems,” in Proc. IEEE ICC Workshops, 2020, pp. 1–6.
- [19] H. Unnibhavi, H. Choi, S. R. Alvar, and I. V. Bajić, “DFTS: Deep feature transmission simulator,” in IEEE MMSP, 2018, demo.
- [20] L. Bragilevsky and I. V. Bajić, “Tensor completion methods for collaborative intelligence,” IEEE Access, vol. 8, pp. 41162–41174, 2020.
- [21] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual non-local attention networks for image restoration,” in Proc. ICLR, 2018.
- [22] F. Mireshghallah, M. Taram, P. Vepakomma, A. Singh, R. Raskar, and H. Esmaeilzadeh, “Privacy in deep learning: A survey,” arXiv:2004.12254, Jul. 2020.
- [23] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proc. IEEE CVPR’17, Jul. 2017, pp. 6517–6525.
- [24] F. Mireshghallah, M. Taram, P. Ramrakhyani, A. Jalali, D. Tullsen, and H. Esmaeilzadeh, “Shredder: Learning noise distributions to protect inference privacy,” in Proc. 25th ACM Int. Conf. Arch. Support Programming Languages and Operating Syst., Mar. 2020, pp. 3–18.
- [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NIPS, 2014, pp. 2672–2680.
- [26] J. Zhou, Y. Wang, K. Ota, and M. Dong, “AAIoT: accelerating artificial intelligence in iot systems,” IEEE Wireless Communications Letters, vol. 8, pp. 825–828, 2019.
- [27] D. Zhang, Y. Ma, C. Zheng, Y. Zhang, X. S. Hu, and D. Wang, “Cooperative-competitive task allocation in edge computing for delay-sensitive social sensing,” 2018 IEEE/ACM Symposium on Edge Computing (SEC), pp. 243–259, 2018.
- [28] A. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–1955, 2016.
- [29] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric median for deep convolutional neural networks acceleration,” in Proc. IEEE CVPR, 2019, pp. 4340–4349.
- [30] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. Reid, “Towards effective low-bitwidth convolutional neural networks,” in Proc. IEEE CVPR, 2018, pp. 7920–7928.
- [31] D. Zhang, J. Yang, D. Ye, and G. Hua, “Lq-nets: Learned quantization for highly accurate and compact deep neural networks,” in Proc. European Conference on Computer Vision (ECCV), 2018, pp. 365–382.
- [32] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks for multi-task learning,” in Proc. IEEE CVPR, 2016, pp. 3994–4003.
- [33] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proc. IEEE CVPR, 2018, pp. 7482–7491.
- [34] P. Georgiev, S. Bhattacharya, N. D. Lane, and C. Mascolo, “Low-resource multi-task audio sensing for mobile and embedded devices via shared deep neural network representations,” Proc. ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, pp. 1–19, 2017.
- [35] K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher, “A joint many-task model: Growing a neural network for multiple nlp tasks,” in Proc. 2017 Conf. Empirical Methods in Natural Language Processing, 01 2017, pp. 1923–1933.
- [36] L. Kaiser, A. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One model to learn them all,” arXiv:1706.05137, Jun. 2017.
- [37] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with residual adapters,” in Proc. NIPS, Dec. 2017, pp. 506–516.
- [38] A. Rosenfeld and J. Tsotsos, “Incremental learning through deep adaptation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 42, pp. 651–663, Mar. 2020.
- [39] S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” in Proc. IEEE CVPR, Jun. 2019, pp. 1871–1880.
- [40] R. Caruana, “Multitask learning: A knowledge-based source of inductive bias,” in Proc. ICML, Jul. 1993, pp. 41–48.
- [41] M. Long and J. Wang, “Learning multiple tasks with deep relationship networks,” arXiv preprint arXiv:1506.02117, vol. 2, pp. 1, 2015.
- [42] X. He, Z. Zhou, and L. Thiele, “Multi-task zipping via layer-wise neuron sharing,” in Proc. NeurIPS, 2018, pp. 6016–6026.
- [43] Y.-M. Chou, Y.-M. Chan, J.-H. Lee, C.-Y. Chiu, and C.-S. Chen, “Unifying and merging well-trained deep neural networks for inference stage,” arXiv:1805.04980, May 2018.
- [44] S. Vandenhende, B. Brabandere, S. Georgoulis, and L. Van Gool, “Branched multi-task networks: Deciding what layers to share,” arXiv:1904.02920, Apr. 2019.