I earned my M.S. degree in Electronic Engineering from Seoul National University of Science and Technology in 2023, where I was advised by Prof. Seung Eun Lee.
During my Master’s, I focused on two main research topics:
Designing flexible architectures to incorporate novel standard for real number arithmetic, specifically “posit”, into general-purpose processors.
Integrating domain-specific hardware for parallel processing into conventional general-purpose processors.
Most of my work was implemented using FPGAs, and some projects were fabricated into ASICs.
Now, I am a Junior Engineer at Hanwha Systems, a leading Korean defense electronics company. In this role, I am responsible for designing SoC FPGA-based image processors and developing RTOS for Heterogeneous MPSoCs. Both of these tasks are related to infrared image processing.
This paper presents a real-time embedded thermal imaging system architecture for compact, energy-efficient, high-quality imaging utilizing heterogeneous system-on-chip and uncooled infrared focal plane arrays (IRFPAs). In contrast to previous systems that organized separate devices for complex image processing, our system provides integrated image processing support for robust sensor-to-surveillance. We organized the image processing architecture into two algorithm stacks: a non-uniformity correction stack to mitigate the distinctive noise vulnerability of uncooled IRFPAs and an image enhancement stack, which includes contrast enhancement and frame-level temporal noise filters. We optimized the algorithms for domain-specific factors, including asymmetric multiprocessing (AMP), cache organization, single instruction multiple data (SIMD) instructions, and very long instruction word (VLIW) architectures. The implementation on TI TDA3x SoC demonstrates that our system can process 640×480, 60 frames per second (FPS) videos at a peak core load of 57.5% while consuming power less than 2.2 W for the entire system, denoting the possibility of processing the 1280×1024, 30 FPS videos from the state-of-the-art IRFPAs.
Designing high-performance hardware sorter for resource-constrained systems is challenging due to physical limitations and the need to balance streaming bandwidth with memory throughput. This paper introduces a novel, scalable hardware sorter architecture with fully-streaming support and an accompanying RTL generator to provide versatile, energy-efficient hardware acceleration. Our solution employs a dual-layer architecture consisting of a parallel one-way linear insertion sorter (OLIS) for bandwidth optimization and a cyclic bitonic merge network (CBMN) for a compact, high-throughput implementation. Furthermore, we developed the RTL generator written in Chisel to provide the agile implementation of the scalable architecture. Experimental results targeting the Xilinx XVU37P-FSVH2892-2L-E FPGA show that our design achieves up to 126.26% increase in throughput and 68.46% decrease in latency, with an area increment of no more than 132.94% for LUTs, and a decrement of up to 79.84% for flip-flops, compared to state-of-the-art streaming sorter.
Embedded stereo vision systems based on traditional approaches often require a disparity refinement process to enhance image quality. Weighted median filter (WMF)-based processors are commonly employed for their excellent refinement performance. However, when implemented on a field-programmable gate array (FPGA), WMF-based processors face a trade-off between hardware resource utilization and refinement performance. To address this trade-off, we previously proposed a new disparity refinement processor based on the hybrid max-median filter (HMMF). However, our earlier work did not guarantee flawless operation in large occluded and texture-less regions, particularly in areas with numerous holes. In order to overcome this limitation of conventional processors, we proposed a cell-based disparity refinement processor. This processor extends our previous HMMF-based disparity refinement processor. To evaluate its refinement performance, we conducted experiments using four types of publicly available stereo datasets. When comparing refinement performance, our proposed processor outperforms conventional processors when using the KITTI 2012 and 2015 stereo benchmark datasets. Additionally, the results demonstrate that our proposed processor exhibits superior refinement performance when applied to the Cityscapes and StereoDriving datasets in comparison to conventional processors. Furthermore, when considering hardware resource utilization, our proposed processor demonstrates lower resource requirements than conventional processors when implemented on an FPGA. Therefore, our proposed disparity refinement processor is well-suited for the disparity refinement process in stereo vision systems that require cost-effectiveness and high performance.
This paper presents an integrated image processor architecture designed for real-time interfacing and processing of high-resolution thermal video obtained from an uncooled infrared focal plane array (IRFPA) utilizing a modern system-onchip field-programmable gate array (SoC FPGA). Our processor provides a one-chip solution for incorporating non-uniformity correction (NUC) algorithms and contrast enhancement methods (CEM) to be performed seamlessly. We have employed NUC algorithms that utilize multiple coefficients to ensure robust image quality, free from ghosting effects and blurring. These algorithms include polynomial modeling-based thermal drift compensation (TDC), two-point correction (TPC), and run-time discrete flat field correction (FFC). To address the memory bottlenecks originating from the parallel execution of NUC algorithms in real-time, we designed accelerators and parallel caching modules for pixel-wise algorithms based on a multi-parameter polynomial expression. Furthermore, we designed a specialized accelerator architecture to minimize the interrupted time for run-time FFC. The implementation on the XC7Z020CLG400 SoC FPGA with the QuantumRed VR thermal module demonstrates that our image processing module achieves a throughput of 60 frames per second (FPS) when processing 14-bit 640×480 resolution infrared video acquired from an uncooled IRFPA.
This paper presents a lightweight processor and evaluation platform for migrating from IEEE-754 to posit arithmetic, with an optimized posit arithmetic unit (PAU) supporting existing floating-point instructions. The PAU features a reconfigurable divider architecture for diverse operating conditions and lightweight square root logic. The platform includes a posit-optimized compiler, divider generator, JTAG environment builder, and programmable logic controller. The experimental results demonstrate the successful execution of legacy IEEE-754 code with a small additional workload and up to 60.09 times the performance improvement through hardware acceleration. Additionally, the PAU and divider consume 11.00% and 57.87% fewer LUTs, respectively, compared to the best prior works.
Edge computing is becoming increasingly popular in artificial intelligence (AI) application development due to the benefits of local execution. One widely used approach to overcome hardware limitations in edge computing is heterogeneous computing, which combines a general-purpose processor with a domain-specific AI processor. However, this approach can be inefficient due to the communication overhead resulting from the complex communication protocol. To avoid communication overhead, the concept of an application-specific instruction set processor based on customizable instruction set architecture (ISA) has emerged. By integrating the AI processor into the processor core, on-chip communication replaces the complex communication protocol. Further, custom instruction set extension (ISE) reduces the number of instructions needed to execute AI applications. In this paper, we propose a uniprocessor system architecture for lightweight AI systems. First, we define the custom ISE to integrate the AI processor and GPP into a single processor, minimizing communication overhead. Next, we designed the processor based on the integrated core architecture, including the base core and the AI core, and implemented the processor on an FPGA. Finally, we evaluated the proposed architecture through simulation and implementation of the processor. The results show that the designed processor consumed 6.62% more lookup tables and 74% fewer flip-flops while achieving up to 193.88 times enhanced throughput performance and 52.75 times the energy efficiency compared to the previous system.
Recently, advances in technology have enabled embedded systems to be adopted for a variety of applications. Some of these applications require real-time 2D graphics processing running on limited design specifications such as low power consumption and a small area. In order to satisfy such conditions, including a specific 2D graphics accelerator in the embedded system is an effective method. This method reduces the workload of the processor in the embedded system by exploiting the accelerator. The accelerator assists the system to perform 2D graphics processing in real-time. Therefore, a variety of applications that require 2D graphics processing can be implemented with an embedded processor. In this paper, we present a 2D graphics accelerator for tiny embedded systems. The accelerator includes an optimized line-drawing operation based on Bresenham’s algorithm. The optimized operation enables the accelerator to deal with various kinds of 2D graphics processing and to perform the line-drawing instead of the system processor. Moreover, the accelerator also distributes the workload of the processor core by removing the need for the core to access the frame buffer memory. We measure the performance of the accelerator by implementing the processor, including the accelerator, on a field-programmable gate array (FPGA), and ascertaining the possibility of realization by synthesizing using the 180 nm CMOS process.
You can even add a little note about which of these is the best way to reach you.