In the autonomous driving industry, there is a growing trend to employ long-wave infrared (LWIR)-based uncooled thermal-imaging cameras, capable of robustly collecting data even in extreme environments. Consequently, both industry and academia are actively researching contrast-enhancement techniques to improve the quality of LWIR-based thermal-imaging cameras. However, most research results only showcase experimental outcomes using mass-produced products that already incorporate contrast-enhancement techniques. Put differently, there is a lack of experimental data on contrast enhancement post-non-uniformity (NUC) and temperature compensation (TC) processes, which generate the images seen in the final products. To bridge this gap, we propose a histogram equalization (HE)-based contrast enhancement method that incorporates a region-based clipping technique. Furthermore, we present experimental results on the images obtained after applying NUC and TC processes. We simultaneously conducted visual and qualitative performance evaluations on images acquired after NUC and TC processes. In the visual evaluation, it was confirmed that the proposed method improves image clarity and contrast ratio compared to conventional HE-based methods, even in challenging driving scenarios such as tunnels. In the qualitative evaluation, the proposed method demonstrated upper-middle-class rankings in both image quality and processing speed metrics. Therefore, our proposed method proves to be effective for the essential contrast enhancement process in LWIR-based uncooled thermal-imaging cameras intended for autonomous driving platforms.
This paper presents a real-time embedded thermal imaging system architecture for compact, energy-efficient, high-quality imaging utilizing heterogeneous system-on-chip and uncooled infrared focal plane arrays (IRFPAs). In contrast to previous systems that organized separate devices for complex image processing, our system provides integrated image processing support for robust sensor-to-surveillance. We organized the image processing architecture into two algorithm stacks: a non-uniformity correction stack to mitigate the distinctive noise vulnerability of uncooled IRFPAs and an image enhancement stack, which includes contrast enhancement and frame-level temporal noise filters. We optimized the algorithms for domain-specific factors, including asymmetric multiprocessing (AMP), cache organization, single instruction multiple data (SIMD) instructions, and very long instruction word (VLIW) architectures. The implementation on TI TDA3x SoC demonstrates that our system can process 640×480, 60 frames per second (FPS) videos at a peak core load of 57.5% while consuming power less than 2.2 W for the entire system, denoting the possibility of processing the 1280×1024, 30 FPS videos from the state-of-the-art IRFPAs.
Designing high-performance hardware sorter for resource-constrained systems is challenging due to physical limitations and the need to balance streaming bandwidth with memory throughput. This paper introduces a novel, scalable hardware sorter architecture with fully-streaming support and an accompanying RTL generator to provide versatile, energy-efficient hardware acceleration. Our solution employs a dual-layer architecture consisting of a parallel one-way linear insertion sorter (OLIS) for bandwidth optimization and a cyclic bitonic merge network (CBMN) for a compact, high-throughput implementation. Furthermore, we developed the RTL generator written in Chisel to provide the agile implementation of the scalable architecture. Experimental results targeting the Xilinx XVU37P-FSVH2892-2L-E FPGA show that our design achieves up to 126.26% increase in throughput and 68.46% decrease in latency, with an area increment of no more than 132.94% for LUTs, and a decrement of up to 79.84% for flip-flops, compared to state-of-the-art streaming sorter.
Embedded stereo vision systems based on traditional approaches often require a disparity refinement process to enhance image quality. Weighted median filter (WMF)-based processors are commonly employed for their excellent refinement performance. However, when implemented on a field-programmable gate array (FPGA), WMF-based processors face a trade-off between hardware resource utilization and refinement performance. To address this trade-off, we previously proposed a new disparity refinement processor based on the hybrid max-median filter (HMMF). However, our earlier work did not guarantee flawless operation in large occluded and texture-less regions, particularly in areas with numerous holes. In order to overcome this limitation of conventional processors, we proposed a cell-based disparity refinement processor. This processor extends our previous HMMF-based disparity refinement processor. To evaluate its refinement performance, we conducted experiments using four types of publicly available stereo datasets. When comparing refinement performance, our proposed processor outperforms conventional processors when using the KITTI 2012 and 2015 stereo benchmark datasets. Additionally, the results demonstrate that our proposed processor exhibits superior refinement performance when applied to the Cityscapes and StereoDriving datasets in comparison to conventional processors. Furthermore, when considering hardware resource utilization, our proposed processor demonstrates lower resource requirements than conventional processors when implemented on an FPGA. Therefore, our proposed disparity refinement processor is well-suited for the disparity refinement process in stereo vision systems that require cost-effectiveness and high performance.
This paper presents an integrated image processor architecture designed for real-time interfacing and processing of high-resolution thermal video obtained from an uncooled infrared focal plane array (IRFPA) utilizing a modern system-onchip field-programmable gate array (SoC FPGA). Our processor provides a one-chip solution for incorporating non-uniformity correction (NUC) algorithms and contrast enhancement methods (CEM) to be performed seamlessly. We have employed NUC algorithms that utilize multiple coefficients to ensure robust image quality, free from ghosting effects and blurring. These algorithms include polynomial modeling-based thermal drift compensation (TDC), two-point correction (TPC), and run-time discrete flat field correction (FFC). To address the memory bottlenecks originating from the parallel execution of NUC algorithms in real-time, we designed accelerators and parallel caching modules for pixel-wise algorithms based on a multi-parameter polynomial expression. Furthermore, we designed a specialized accelerator architecture to minimize the interrupted time for run-time FFC. The implementation on the XC7Z020CLG400 SoC FPGA with the QuantumRed VR thermal module demonstrates that our image processing module achieves a throughput of 60 frames per second (FPS) when processing 14-bit 640×480 resolution infrared video acquired from an uncooled IRFPA.
In embedded stereo vision systems based on semiglobal matching, the matching accuracy of the initial disparity map can be degraded because of various factors. To solve this problem, weighted median-based disparity refinement hardware architectures are utilized to improve the matching accuracy. However, for the conventional hardware architectures, there is a trade-off between hardware resource utilization and refinement performance when they are implemented on a field programmable gate array (FPGA). Therefore, in this paper, we propose a hybrid max-median filter and its hardware architecture to improve the refinement performance and reduce hardware resource utilization. To evaluate the refinement performance, we used two public stereo datasets. When using the various window sizes for KITTI 2012 and 2015 stereo benchmark datasets, the proposed hardware architecture showed better matching accuracy performance compared with the conventional hardware architectures. In terms of the hardware resource utilization, when implemented on an FPGA, the proposed hardware architecture has low requirements for all types of hardware resources. That is, the proposed hardware architecture overcomes the trade-off between hardware resource utilization and refinement performance.
This paper presents a lightweight processor and evaluation platform for migrating from IEEE-754 to posit arithmetic, with an optimized posit arithmetic unit (PAU) supporting existing floating-point instructions. The PAU features a reconfigurable divider architecture for diverse operating conditions and lightweight square root logic. The platform includes a posit-optimized compiler, divider generator, JTAG environment builder, and programmable logic controller. The experimental results demonstrate the successful execution of legacy IEEE-754 code with a small additional workload and up to 60.09 times the performance improvement through hardware acceleration. Additionally, the PAU and divider consume 11.00% and 57.87% fewer LUTs, respectively, compared to the best prior works.
Edge computing is becoming increasingly popular in artificial intelligence (AI) application development due to the benefits of local execution. One widely used approach to overcome hardware limitations in edge computing is heterogeneous computing, which combines a general-purpose processor with a domain-specific AI processor. However, this approach can be inefficient due to the communication overhead resulting from the complex communication protocol. To avoid communication overhead, the concept of an application-specific instruction set processor based on customizable instruction set architecture (ISA) has emerged. By integrating the AI processor into the processor core, on-chip communication replaces the complex communication protocol. Further, custom instruction set extension (ISE) reduces the number of instructions needed to execute AI applications. In this paper, we propose a uniprocessor system architecture for lightweight AI systems. First, we define the custom ISE to integrate the AI processor and GPP into a single processor, minimizing communication overhead. Next, we designed the processor based on the integrated core architecture, including the base core and the AI core, and implemented the processor on an FPGA. Finally, we evaluated the proposed architecture through simulation and implementation of the processor. The results show that the designed processor consumed 6.62% more lookup tables and 74% fewer flip-flops while achieving up to 193.88 times enhanced throughput performance and 52.75 times the energy efficiency compared to the previous system.
Recent advances in semiconductor technology lead to ongoing applications to adopt complex techniques based on neural networks. In line with this trend, the concept of optimizing real number arithmetic has been raised. In this paper, we evaluate the performance of the noble number system named posit on neural networks by analyzing the execution of approximate exponential functions, which is fundamental to several activation functions, with posit32 and float32. To implement the functions with posit arithmetic, we designed the software posit library consisting of basic arithmetic operations and conversion operations from/to C standard data types. The result shows that posit arithmetic reduces the average relative error rate by up to 87.12% on the exponential function.
Recently, studies have been conducted on intelligent transportation systems (ITS) that provide safety and convenience to humans. Systems that compose the ITS adopt architectures that applied the cloud computing which consists of a highperformance general-purpose processor or graphics processing unit. However, an architecture that only used the cloud computing requires a high network bandwidth and consumes much power. Therefore, applying edge computing to ITS is essential for solving these problems. In this paper, we propose an edge artificial intelligence (AI) device based ITS. Edge AI which is applicable to various systems in ITS has been applied to license plate recognition. We implemented edge AI on a fieldprogrammable gate array (FPGA). The accuracy of the edge AI for license plate recognition was 0.94. Finally, we synthesized the edge AI logic with Magnachip/Hynix 180nm CMOS technology and the power consumption measured using the Synopsys’s design compiler tool was 482.583mW.
An intelligent transportation system (ITS) is a future system that combines various technologies to provide safety and convenience to humans. In order to implement ITS, previous systems applied an architecture that contains a large number of data centers with a high-performance general-purpose processor and graphics processing unit to collect the information of vehicles. However, this architecture not only requires a high network bandwidth but also causes the system to decrease power efficiency and makes security weak. In this paper, we propose an ITS based on an edge AI device which solves problems with the existing structure. We applied the edge AI device which is applicable to various systems in ITS to license plate recognition and the highest accuracy was 0.94. We implemented the edge AI device on a field programmable gate array (FPGA) and verified the feasibility of the entire system with the proposed edge AI device.
As the amount of data for automotive systems is increased, a dedicated communication controller for in-vehicle networks is required. This paper proposes a local interconnect network (LIN) controller for resource-constrained devices. The designed LIN controller efficiently reduces the workload of target devices by processing the LIN frame header, data response, and protocol errors. To demonstrate the feasibility of design, a Cortex-M0 is employed as a main processor and connected to the LIN controller. We implemented a LIN node by programming the processor, and the functionality of LIN controller was verified with a LIN frame analyzer and hardware scope. In addition, we analyzed the affection of communication loads on the processor and evaluated the benefits of LIN controller.
Recent advances in artificial intelligence (AI) technology encourage the adoption of AI systems for various applications. In most deployments, AI-based computing systems adopt the architecture in which the central server processes most of the data. This characteristic makes the system use a high amount of network bandwidth and can cause security issues. In order to overcome these issues, a new AI model called federated learning was presented. Federated learning adopts an architecture in which the clients take care of data training and transmit only the trained result to the central server. As the data training from the client abstracts and reduces the original data, the system operates with reduced network resources and reinforced data security. A system with federated learning supports a variety of client systems. To build an AI system with resource-limited client systems, composing the client system with multiple embedded AI processors is valid. For realizing the system with this architecture, introducing a controller to arbitrate and utilize the AI processors becomes a stringent requirement. In this paper, we propose an embedded AI system for federated learning that can be composed flexibly with the AI core depending on the application. In order to realize the proposed system, we designed a controller for multiple AI cores and implemented it on a field-programmable gate array (FPGA). The operation of the designed controller was verified through image and speech applications, and the performance was verified through a simulator.
Artificial intelligence algorithms need an external computing device such as a graphics processing unit (GPU) due to computational complexity. For running artificial intelligence algorithms in an embedded device, many studies proposed light-weighted artificial intelligence algorithms and artificial intelligence accelerators. In this paper, we propose the ASimOV framework, which optimizes artificial intelligence algorithms and generates Verilog hardware description language (HDL) code for executing intelligence algorithms in field programmable gate array (FPGA). To verify ASimOV, we explore the performance space of k-NN algorithms and generate Verilog HDL code to demonstrate the k-NN accelerator in FPGA. Our contribution is to provide the artificial intelligence algorithm as an end-to-end pipeline and ensure that it is optimized to a specific dataset through simulation, and an artificial intelligence accelerator is generated in the end.
Recently, advances in technology have enabled embedded systems to be adopted for a variety of applications. Some of these applications require real-time 2D graphics processing running on limited design specifications such as low power consumption and a small area. In order to satisfy such conditions, including a specific 2D graphics accelerator in the embedded system is an effective method. This method reduces the workload of the processor in the embedded system by exploiting the accelerator. The accelerator assists the system to perform 2D graphics processing in real-time. Therefore, a variety of applications that require 2D graphics processing can be implemented with an embedded processor. In this paper, we present a 2D graphics accelerator for tiny embedded systems. The accelerator includes an optimized line-drawing operation based on Bresenham’s algorithm. The optimized operation enables the accelerator to deal with various kinds of 2D graphics processing and to perform the line-drawing instead of the system processor. Moreover, the accelerator also distributes the workload of the processor core by removing the need for the core to access the frame buffer memory. We measure the performance of the accelerator by implementing the processor, including the accelerator, on a field-programmable gate array (FPGA), and ascertaining the possibility of realization by synthesizing using the 180 nm CMOS process.
The development of the mobile industry brings about the demand for high-performance embedded systems in order to meet the requirement of user-centered application. Because of the limitation of memory resource, employing compressed data is efficient for an embedded system. However, the workload for data decompression causes an extreme bottleneck to the embedded processor. One of the ways to alleviate the bottleneck is to integrate a hardware accelerator along with the processor, constructing a system-on-chip (SoC) for the embedded system. In this paper, we propose a lossless decompression accelerator for an embedded processor, which supports LZ77 decompression and static Huffman decoding for an inflate algorithm. The accelerator is implemented on a field programmable gate array (FPGA) to verify the functional suitability and fabricated in a Samsung 65 nm complementary metal oxide semiconductor (CMOS) process. The performance of the accelerator is evaluated by the Canterbury corpus benchmark and achieved throughput up to 20.7 MB/s at 50 MHz system clock frequency.
Recently, as the interest of smart parking system is increasing, the various methods for detecting parking occupation are under study. In this paper, we present a vision-based parking occupation detection with embedded AI processor. By employing a fisheye lens camera, multiple parking slot states are identified in one device. We measure the recognition rate of the AI processor in the proposed system and determine the optimized configuration with software simulator. The highest recognition rate is measured at 94.48% in the configuration of 64 number of training data with 256 bytes data size.
In this paper, we propose a 32-bit processor for the embedded system. In order to provide less area and low power operation, we adopt MIPS instruction set architecture (ISA) to our processor. The processor consists of five pipeline stages to reduce the critical path. In order to solve the data hazard in pipeline stages, we design the data forwarding unit and stall unit with optimized bubble insertion. The processor is implemented on a field programmable gate array (FPGA), and we verify the functionality of the processor and measure the performance by using the Dhrystone benchmark. The Dhrystone MIPS (DMIPS) is measured at 27.71 at 50 MHz operation.