publications
List of peer-reviewed publications in reverse chronical order.
- Types
Int'l Conference Int'l Journal
2026
- T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU ReorganizationHyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, and Mohsen ImaniDesign, Automation and Test in Europe Conference (DATE), Verona, Italy, Apr 2026, pp. 1–7
Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.
@inproceedings{oh_tsar_2026, address = {Verona, Italy}, title = {{T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization}}, isbn = {}, url = {}, doi = {}, booktitle = {{Design, Automation and Test in Europe Conference (DATE)}}, author = {Oh, Hyunwoo and Nam, KyungIn and Bhattacharjya, Rajat and Chen, Hanning and Das, Tamoghno and Yun, Sanggeon and Jang, Suyeon and Ding, Andrew and Dutt, Nikil and Imani, Mohsen}, month = apr, year = {2026}, pages = {1--7}, } - QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable AttentionHyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang, Tamoghno Das, Suyeon Jang, and Mohsen ImaniDesign, Automation and Test in Europe Conference (DATE), Verona, Italy, Apr 2026, pp. 1–7
Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer–forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W”m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality–and locality into utilization–QUILL delivers consistent, end-to-end speedups.
@inproceedings{oh_quill_2026, address = {Verona, Italy}, title = {{QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention}}, isbn = {}, url = {}, doi = {}, booktitle = {{Design, Automation and Test in Europe Conference (DATE)}}, author = {Oh, Hyunwoo and Chen, Hanning and Yun, Sanggeon and Ni, Yang and Huang, Wenjun and Das, Tamoghno and Jang, Suyeon and Imani, Mohsen}, month = apr, year = {2026}, pages = {1--7}, } - RIFT: A Single-Bitstream, Runtime-Adaptive FPGA-Based Accelerator for Multimodal AIHyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, and Mohsen ImaniDesign, Automation and Test in Europe Conference (DATE), Verona, Italy, Apr 2026, pp. 1–7
Multimodal models spanning ViTs, CNNs, GNNs, and transformer NLP stress embedded systems because their heterogeneous compute and memory behaviors complicate resource allocation, load balancing, and real-time inference. We present RIFT, a single-bitstream FPGA accelerator and compiler for end-to-end multimodal inference. RIFT unifies layers as DDMM/SDDMM/SpMM kernels executed on a runtime modeswitchable engine that morphs among weight-/output-stationary systolic, 1×CS SIMD, and a routable adder tree (RADT) on a shared datapath. A two-stage hardware top-k unit, width-matched to the array, performs in-stream token pruning with minimal buffering, and dependency-aware layer offloading (DALO) overlaps independent kernels across multiple RPUs—achieving adaptation without bitstream reconfiguration. On Alveo U50 and ZCU104, RIFT reduces latency by up to 22.57× versus an RTX 4090 and 6.86× versus a Jetson Orin Nano at ∼20–21 W; pruning alone yields up to 7.8× on ViT-heavy workloads. Ablations isolate contributions, with DALO improving throughput by up to 79%. Compared to prior FPGA designs, RIFT delivers state-of-the-art latency and energy efficiency across vision, language, and graph workloads in a single bitstream.
@inproceedings{oh_rift_2026, address = {Verona, Italy}, title = {{RIFT: A Single-Bitstream, Runtime-Adaptive FPGA-Based Accelerator for Multimodal AI}}, isbn = {}, url = {}, doi = {}, booktitle = {{Design, Automation and Test in Europe Conference (DATE)}}, author = {Oh, Hyunwoo and Nam, KyungIn and Bhattacharjya, Rajat and Chen, Hanning and Das, Tamoghno and Yun, Sanggeon and Jang, Suyeon and Ding, Andrew and Dutt, Nikil and Imani, Mohsen}, month = apr, year = {2026}, pages = {1--7}, } - DecoHD: Decomposed Hyperdimensional Classification under Extreme Memory BudgetsSanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, and Mohsen ImaniDesign, Automation and Test in Europe Conference (DATE), Verona, Italy, Apr 2026, pp. 1–7
Decomposition is a proven way to shrink deep networks without changing I/O. We bring this idea to hyperdimensional computing (HDC), where footprint cuts usually shrink the feature axis and erode concentration and robustness. Prior HDC decompositions decode via fixed atomic hypervectors, which are ill-suited for compressing learned class prototypes. We introduce DecoHD, which learns directly in a decomposed HDC parameterization: a small, shared set of per-layer channels with multiplicative binding across layers and bundling at the end, yielding a large representational space from compact factors. DecoHD compresses along the class axis via a lightweight bundling head while preserving native bind-bundle-score; training is end-to-end, and inference remains pure HDC, aligning with in/near-memory accelerators. In evaluation, DecoHD attains extreme memory savings with only minor accuracy degradation under tight deployment budgets. On average it stays within about 0.1-0.15% of a strong non-reduced HDC baseline (worst case 5.7%), is more robust to random bit-flip noise, reaches its accuracy plateau with up to 97% fewer trainable parameters, and – in hardware – delivers roughly 277x/35x energy/speed gains over a CPU (AMD Ryzen 9 9950X), 13.5x/3.7x over a GPU (NVIDIA RTX 4090), and 2.0x/2.4x over a baseline HDC ASIC.
@inproceedings{yun_decohd_2026, address = {Verona, Italy}, title = {{DecoHD: Decomposed Hyperdimensional Classification under Extreme Memory Budgets}}, isbn = {}, url = {}, doi = {}, booktitle = {{Design, Automation and Test in Europe Conference (DATE)}}, author = {Yun, Sanggeon and Oh, Hyunwoo and Masukawa, Ryozo and Imani, Mohsen}, month = apr, year = {2026}, pages = {1--7}, } - LogHD: Robust Compression of Hyperdimensional Classifiers via Logarithmic Class-Axis ReductionSanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Pietro Mercati, Nathaniel D Bastian, and Mohsen ImaniDesign, Automation and Test in Europe Conference (DATE), Verona, Italy, Apr 2026, pp. 1–7
Hyperdimensional computing (HDC) suits memory, energy, and reliability-constrained systems, yet the standard "one prototype per class" design requires O(CD) memory (with C classes and dimensionality D). Prior compaction reduces D (feature axis), improving storage/compute but weakening robustness. We introduce LogHD, a logarithmic class-axis reduction that replaces the C per-class prototypes with n≈⌈logkC⌉ bundle hypervectors (alphabet size k) and decodes in an n-dimensional activation space, cutting memory to O(DlogkC) while preserving D. LogHD uses a capacity-aware codebook and profile-based decoding, and composes with feature-axis sparsification. Across datasets and injected bit flips, LogHD attains competitive accuracy with smaller models and higher resilience at matched memory. Under equal memory, it sustains target accuracy at roughly 2.5-3.0× higher bit-flip rates than feature-axis compression; an ASIC instantiation delivers 498× energy efficiency and 62.6× speedup over an AMD Ryzen 9 9950X and 24.3×/6.58× over an NVIDIA RTX 4090, and is 4.06× more energy-efficient and 2.19× faster than a feature-axis HDC ASIC baseline.
@inproceedings{yun_loghd_2026, address = {Verona, Italy}, title = {{LogHD: Robust Compression of Hyperdimensional Classifiers via Logarithmic Class-Axis Reduction}}, isbn = {}, url = {}, doi = {}, booktitle = {{Design, Automation and Test in Europe Conference (DATE)}}, author = {Yun, Sanggeon and Oh, Hyunwoo and Masukawa, Ryozo and Mercati, Pietro and Bastian, Nathaniel D and Imani, Mohsen}, month = apr, year = {2026}, pages = {1--7}, } - Scalable Symbolic Reasoning with Matrix-Based Brain-Inspired Representations and Vector-Space AccelerationWilliam Youngwoo Chung, Hyunwoo Oh, Hamza Errahmouni Barkam, Calvin Yeung, and Mohsen ImaniDesign, Automation and Test in Europe Conference (DATE), Verona, Italy, Apr 2026, pp. 1–7
@inproceedings{chung_scalable_2026, address = {Verona, Italy}, title = {{Scalable Symbolic Reasoning with Matrix-Based Brain-Inspired Representations and Vector-Space Acceleration}}, isbn = {}, url = {}, doi = {}, booktitle = {{Design, Automation and Test in Europe Conference (DATE)}}, author = {Chung, William Youngwoo and Oh, Hyunwoo and Barkam, Hamza Errahmouni and Yeung, Calvin and Imani, Mohsen}, month = apr, year = {2026}, pages = {1--7}, } - AAMLA: An Autonomous Agentic Framework for Memory-Aware LLM-Aided Hardware GenerationRajat Bhattacharjya, Juhee Sung, Hangyeol Jung, Hyunwoo Oh, Arnab Sarkar, Mohsen Imani, and Nikil DuttInternational Conference On VLSI Design (VLSID), Pune, Maharashtra, India, Jan 2026, pp. 1–7
Large Language Models (LLMs) have recently emerged as powerful assistants for hardware design, translating natural-language specifications into Hardware Description Languages (HDLs), yet fine-tuning these models on domainspecific corpora routinely exceeds the memory capacity of commodity GPUs and triggers Out-of-Memory (OoM) errors. We present AAMLA, an autonomous agentic framework that converts this pain point into a pushbutton experience. AAMLA incorporates memory awareness by coupling (i) a predictive memory profiler-LLMem++-that significantly extends the original LLMem framework to support a diverse set of memoryefficient fine-tuning strategies, including adapter-based (LoRA, DoRA), gradient-free (MeZO), token-sparse (TokenTune), and optimizer-modified (APOLLO) methods, with (ii) a portfolio of complementary memory-efficient adaptation techniques-leading to a complete synthesis flow. The system allows for a twostep early design space exploration workflow: it first prunes any method whose predicted footprint would violate the user’s GPU budget, then consults an offline accuracy-latency atlas to recommend the Pareto-optimal strategy that aligns with the designer’s stated priority (accuracy or turnaround time). Guided by this workflow, an agentic controller configures and applies the chosen technique, guaranteeing OoM-free execution without manual trial-and-error. The framework is model-and datasetagnostic, easily extensible with new tuning primitives, and exposes a simple interface that accepts natural-language prompts and emits synthesizable Verilog, thereby lowering the barrier to LLM-assisted hardware design for researchers and organizations with limited computational resources.
@inproceedings{bhattacharjya_aamla_2026, address = {Pune, Maharashtra, India}, title = {{AAMLA: An Autonomous Agentic Framework for Memory-Aware LLM-Aided Hardware Generation}}, isbn = {}, url = {}, doi = {}, booktitle = {{International Conference On VLSI Design (VLSID)}}, author = {Bhattacharjya, Rajat and Sung, Juhee and Jung, Hangyeol and Oh, Hyunwoo and Sarkar, Arnab and Imani, Mohsen and Dutt, Nikil}, month = jan, year = {2026}, pages = {1--7}, }
2025
- LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning SegmentationHanning Chen, Yang Ni, Wenjun Huang, Hyunwoo Oh, Yezi Liu, Tamoghno Das, and Mohsen ImaniACM International Conference on Multimedia (MM), Dublin, Ireland, Jan 2025, pp. 3932–3941
Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens-a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high-level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training-free visual token pruning method specifically designed for LVLM-based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine-grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.
@inproceedings{10.1145/3746027.3755243, address = {Dublin, Ireland}, title = {{LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation}}, author = {Chen, Hanning and Ni, Yang and Huang, Wenjun and Oh, Hyunwoo and Liu, Yezi and Das, Tamoghno and Imani, Mohsen}, isbn = {9798400720352}, url = {https://doi.org/10.1145/3746027.3755243}, doi = {10.1145/3746027.3755243}, booktitle = {ACM International Conference on Multimedia (MM)}, year = {2025}, pages = {3932–3941}, } - Revisiting Reconfigurable Acceleration of Vision Transformer with Patch PruningHanning Chen, Yang Ni, Wenjun Huang, Hyunwoo Oh, Tamoghno Das, Fei Wen, and Mohsen ImaniACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), University of Iceland, Iceland, Aug 2025, pp. 1–7
@inproceedings{chen_revisit_2025, address = {University of Iceland, Iceland}, title = {Revisiting Reconfigurable Acceleration of Vision Transformer with Patch Pruning}, isbn = {}, url = {}, doi = {}, booktitle = {{ACM}/{IEEE} {International} {Symposium} on {Low} {Power} {Electronics} and {Design} ({ISLPED})}, publisher = {IEEE}, author = {Chen, Hanning and Ni, Yang and Huang, Wenjun and Oh, Hyunwoo and Das, Tamoghno and Wen, Fei and Imani, Mohsen}, month = aug, year = {2025}, pages = {1--7}, } - iTaskSense: Task-Oriented Object Detection in Resource-Constrained EnvironmentsSungHeon Jeong, Hamza Errahmouni Barkam, Hyunwoo Oh, Hanning Chen, Tamoghno Das, Zhen Ye, and Mohsen ImaniACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, Jun 2025, pp. 1–7
Task-oriented object detection is increasingly essential for intelligent sensing applications, enabling AI systems to operate autonomously in complex, real-world environments such as autonomous driving, healthcare, and industrial automation. Conventional models often struggle with generalization, requiring vast datasets to accurately detect objects within diverse contexts. In this work, we introduce iTask, a taskoriented object detection framework that leverages large language models (LLMs) to generalize efficiently from limited samples by generating an abstract knowledge graph. This graph encapsulates essential task attributes, allowing iTask to identify objects based on high-level characteristics rather than extensive data, making it possible to adapt to complex mission requirements with minimal samples. iTask addresses the challenges of high computational cost and resource limitations in vision-language models by offering two configuration models: a distilled, task-specific vision transformer optimized for high accuracy in defined tasks, and a quantized version of the model for broader applicability across multiple tasks. Additionally, we designed a hardware acceleration circuit to support real-time processing, essential for edge devices that require low latency and efficient task execution. Our evaluations show that the task-specific configuration achieves a 15% higher accuracy over the quantized configuration in specific scenarios, while the quantized model provides robust multi-task performance. The hardware-accelerated iTask system achieves a 3.5x speedup and a 40% reduction in energy consumption compared to GPU-based implementations. These results demonstrate that iTask ’s dual-configuration approach and situational adaptability offer a scalable solution for task-specific object detection, providing robust and efficient performance in resource-constrained environments.
@inproceedings{jeong_itask_2025, address = {San Francisco, CA, USA}, title = {{iTaskSense: Task-Oriented Object Detection in Resource-Constrained Environments}}, isbn = {}, doi = {10.1109/DAC63849.2025.11133060}, booktitle = {{ACM/IEEE Design Automation Conference (DAC)}}, author = {Jeong, SungHeon and Barkam, Hamza Errahmouni and Oh, Hyunwoo and Chen, Hanning and Das, Tamoghno and Ye, Zhen and Imani, Mohsen}, month = jun, year = {2025}, pages = {1--7}, } - A Multimodal AI Acceleration with Dynamic Pruning and Run-time ConfigurationHyun Woo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Behnam Khaleghi, Fei Wen, and Mohsen ImaniIEEE International Symposium on Field-Programmable Custom Computing Machines(FCCM), Fayetteville, AR, USA, May 2025
The computational diversity of multimodal AI workloads—spanning vision transformers (ViTs), graph neural networks (GNNs), CNNs, and transformer-based NLP—poses a fundamental challenge to embedded acceleration platforms. We propose a fully integrated FPGA-based acceleration framework that addresses this heterogeneity via compile-time and runtime configurability. Our system introduces a reconfigurable processing unit (RPU) capable of executing dense and sparse matrix operations (DDMM, SpMM, SDDMM), a scalable top-k pruning engine for ViTs, and a domain-specific compiler for hardware-software co-design. The architecture supports real-time configuration without reloading bitstreams, enabling unified deployment across tasks. Implementations on Xilinx U50 and ZCU104 demonstrate up to 22.57× and 6.86× latency reductions versus RTX 4090 and Jetson Orin Nano, respectively, validating the design’s efficiency for real-time, resource-limited environments.
@inproceedings{oh_multimodal_2025, address = {Fayetteville, AR, USA}, title = {{A Multimodal AI Acceleration with Dynamic Pruning and Run-time Configuration}}, isbn = {}, doi = {10.1109/FCCM62733.2025.00072}, booktitle = {{IEEE International Symposium on Field-Programmable Custom Computing Machines(FCCM)}}, author = {Oh, Hyun Woo and Chen, Hanning and Yun, Sanggeon and Ni, Yang and Khaleghi, Behnam and Wen, Fei and Imani, Mohsen}, month = may, year = {2025}, } - EOS: Edge-Based Operation Skip Scheme for Real-Time Object Detection Using Viola-Jones ClassifierCheol-Ho Choi, Joonhwan Han, Hyun Woo Oh, Jeongwoo Cha, and Jungho ShinElectronics, May 2025
Machine learning-based object detection systems are preferred due to their cost-effectiveness compared to deep learning approaches. Among machine learning methods, the Viola-Jones classifier stands out for its reasonable accuracy and efficient resource utilization. However, as the number of classification iterations increases or the resolution of the input image increases, the detection processing speed may decrease. To address the detection speed issue related to input image resolution, an improved edge component calibration method is applied. Additionally, an edge-based operation skip scheme is proposed to overcome the detection processing speed problem caused by the number of classification iterations. Our experiments using the FDDB public dataset show that our method reduces classification iterations by 24.6157% to 84.1288% compared to conventional methods, except for our previous study. Importantly, our method maintains detection accuracy while reducing classification iterations. This result implies that our method can realize almost real-time object detection when implemented on field-programmable gate arrays.
@article{choi_eos_2025, title = {{EOS: Edge-Based Operation Skip Scheme for Real-Time Object Detection Using Viola-Jones Classifier}}, issn = {2079-9292}, url = {https://www.mdpi.com/2079-9292/14/2/397}, doi = {10.3390/electronics14020397}, author = {Choi, Cheol-Ho and Han, Joonhwan and Oh, Hyun Woo and Cha, Jeongwoo and Shin, Jungho}, journal = {Electronics}, vol = {14}, year = {2025}, no = {397}, }
2024
- Algorithm for LWIR Thermal Imaging Camera with Minimal Mechanical Shutter UtilizationTaehyun Kim, Joonhwan Han, Jeongwoo Cha, Hyunmin Choi, Jungho Shin, Eunchong Kim, Hyun Woo Oh, Cheol-Ho Choi, Seongtaek Hong, and Taehyung KimIEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Danang, Vietnam, Nov 2024, pp. 1–4
Uncooled LWIR (Long-Wave InfraRed) thermal imaging cameras are characterized by non-uniformity. because infrared detectors exhibit nonlinear characteristics depending on the environmental temperature. In this paper, we propose a method to smoothly transition between a method of correcting non-uniformity using a shutter one time when the thermal imaging camera is not stable at the start-up, and a method of correcting non-uniformity by performing conventional NUC (Non-Uniformity Correction) when thermal image camera is stabilized. The proposed method was confirmed to have similar performance to the conventional method in which the thermal imaging camera uses the shutter several times during initial start-up. The conventional method closes the shutter multiple times to correct non-uniformity, which obscures information necessary for driving. In contrast, the proposed method closes the shutter only one time during initial start-up to correct non-uniformity, which does not obscure information necessary for driving. Therefore, it is suitable for auxiliary systems used in autonomous driving platforms.
@inproceedings{kim_algorithm_2024, address = {Danang, Vietnam}, title = {{Algorithm for LWIR Thermal Imaging Camera with Minimal Mechanical Shutter Utilization}}, isbn = {979-8-3315-3083-9}, doi = {10.1109/ICCE-Asia63397.2024.10773806}, author = {Kim, Taehyun and Han, Joonhwan and Cha, Jeongwoo and Choi, Hyunmin and Shin, Jungho and Kim, Eunchong and Oh, Hyun Woo and Choi, Cheol-Ho and Hong, Seongtaek and Kim, Taehyung}, booktitle = {{IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia)}}, year = {2024}, month = nov, pages = {1--4}, } - Contrast Enhancement Method using Region-based Dynamic Clipping Technique for LWIR-based Thermal Camera of Night Vision SystemsCheol-Ho Choi, Joonhwan Han, Jeongwoo Cha, Hyunmin Choi, Jungho Shin, Taehyun Kim, and Hyun Woo OhSensors, Jun 2024, pp. 3829
In the autonomous driving industry, there is a growing trend to employ long-wave infrared (LWIR)-based uncooled thermal-imaging cameras, capable of robustly collecting data even in extreme environments. Consequently, both industry and academia are actively researching contrast-enhancement techniques to improve the quality of LWIR-based thermal-imaging cameras. However, most research results only showcase experimental outcomes using mass-produced products that already incorporate contrast-enhancement techniques. Put differently, there is a lack of experimental data on contrast enhancement post-non-uniformity (NUC) and temperature compensation (TC) processes, which generate the images seen in the final products. To bridge this gap, we propose a histogram equalization (HE)-based contrast enhancement method that incorporates a region-based clipping technique. Furthermore, we present experimental results on the images obtained after applying NUC and TC processes. We simultaneously conducted visual and qualitative performance evaluations on images acquired after NUC and TC processes. In the visual evaluation, it was confirmed that the proposed method improves image clarity and contrast ratio compared to conventional HE-based methods, even in challenging driving scenarios such as tunnels. In the qualitative evaluation, the proposed method demonstrated upper-middle-class rankings in both image quality and processing speed metrics. Therefore, our proposed method proves to be effective for the essential contrast enhancement process in LWIR-based uncooled thermal-imaging cameras intended for autonomous driving platforms.
@article{choi_contrast_2024, title = {{Contrast} {Enhancement} {Method} using {Region}-based {Dynamic} {Clipping} {Technique} for {LWIR}-based {Thermal} {Camera} of {Night} {Vision} {Systems}}, volume = {24}, issn = {1424-8220}, url = {https://www.mdpi.com/1424-8220/24/12/3829}, doi = {10.3390/s24123829}, number = {12}, journal = {Sensors}, author = {Choi, Cheol-Ho and Han, Joonhwan and Cha, Jeongwoo and Choi, Hyunmin and Shin, Jungho and Kim, Taehyun and Oh, Hyun Woo}, month = jun, year = {2024}, pages = {3829}, } - A Compact Real-Time Thermal Imaging System Based on Heterogeneous System-on-ChipHyun Woo Oh, Cheol-Ho Choi, Jeong Woo Cha, Hyunmin Choi, Jung-Ho Shin, and Joon Hwan HanIEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Aug 2024, pp. 97–107
This paper presents a real-time embedded thermal imaging system architecture for compact, energy-efficient, high-quality imaging utilizing heterogeneous system-on-chip and uncooled infrared focal plane arrays (IRFPAs). In contrast to previous systems that organized separate devices for complex image processing, our system provides integrated image processing support for robust sensor-to-surveillance. We organized the image processing architecture into two algorithm stacks: a non-uniformity correction stack to mitigate the distinctive noise vulnerability of uncooled IRFPAs and an image enhancement stack, which includes contrast enhancement and frame-level temporal noise filters. We optimized the algorithms for domain-specific factors, including asymmetric multiprocessing (AMP), cache organization, single instruction multiple data (SIMD) instructions, and very long instruction word (VLIW) architectures. The implementation on TI TDA3x SoC demonstrates that our system can process 640×480, 60 frames per second (FPS) videos at a peak core load of 57.5% while consuming power less than 2.2 W for the entire system, denoting the possibility of processing the 1280×1024, 30 FPS videos from the state-of-the-art IRFPAs.
@inproceedings{oh_compact_2024, address = {Sokcho, Korea}, title = {A {Compact} {Real}-{Time} {Thermal} {Imaging} {System} {Based} on {Heterogeneous} {System}-on-{Chip}}, isbn = {979-8-3503-8795-7}, doi = {10.1109/RTCSA62462.2024.00023}, booktitle = {{IEEE} {International} {Conference} on {Embedded} and {Real}-{Time} {Computing} {Systems} and {Applications} ({RTCSA})}, publisher = {IEEE}, author = {Oh, Hyun Woo and Choi, Cheol-Ho and Cha, Jeong Woo and Choi, Hyunmin and Shin, Jung-Ho and Han, Joon Hwan}, month = aug, year = {2024}, pages = {97--107}, } - Fast Object Detection Algorithm using Edge-based Operation Skip Scheme with Viola-Jones MethodCheol-Ho Choi, Joonhwan Han, Jeongwoo Cha, Jungho Shin, and Hyun Woo OhIEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Abu Dhabi, UAE, Apr 2024, pp. 199–203
@inproceedings{choi_fast_2024, address = {Abu Dhabi, UAE}, title = {Fast Object Detection Algorithm using Edge-based Operation Skip Scheme with Viola-Jones Method}, booktitle = {{IEEE} {International} {Conference} on {Artificial} {Intelligence} {Circuits} and {Systems} ({AICAS})}, publisher = {IEEE}, author = {Choi, Cheol-Ho and Han, Joonhwan and Cha, Jeongwoo and Shin, Jungho and Oh, Hyun Woo}, month = apr, year = {2024}, pages = {199--203}, } - DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingHyun Woo Oh, Joungmin Park, and Seung Eun LeeIEEE Transactions on Circuits and Systems II: Express Briefs, May 2024, pp. 2549–2553
Designing high-performance hardware sorter for resource-constrained systems is challenging due to physical limitations and the need to balance streaming bandwidth with memory throughput. This paper introduces a novel, scalable hardware sorter architecture with fully-streaming support and an accompanying RTL generator to provide versatile, energy-efficient hardware acceleration. Our solution employs a dual-layer architecture consisting of a parallel one-way linear insertion sorter (OLIS) for bandwidth optimization and a cyclic bitonic merge network (CBMN) for a compact, high-throughput implementation. Furthermore, we developed the RTL generator written in Chisel to provide the agile implementation of the scalable architecture. Experimental results targeting the Xilinx XVU37P-FSVH2892-2L-E FPGA show that our design achieves up to 126.26% increase in throughput and 68.46% decrease in latency, with an area increment of no more than 132.94% for LUTs, and a decrement of up to 79.84% for flip-flops, compared to state-of-the-art streaming sorter.
@article{oh_dl-sort_2024, title = {{DL-Sort}: {A} {Hybrid} {Approach} {to} {Scalable} {Hardware}-{Accelerated} {Fully}-{Streaming} {Sorting}}, volume = {71}, number = {5}, issn = {1549-7747}, url = {https://ieeexplore.ieee.org/document/10472626}, doi = {10.1109/TCSII.2024.3377255}, journal = {IEEE Transactions on Circuits and Systems II: Express Briefs}, author = {Oh, Hyun Woo and Park, Joungmin and and Lee, Seung Eun}, month = may, year = {2024}, pages = {2549--2553}, }
2023
- Cell-Based Refinement Processor Utilizing Disparity Characteristics of Road Environment for SGM-Based Stereo Vision SystemsCheol-Ho Choi, Hyun Woo Oh, Joonhwan Han, and Jungho ShinIEEE Access, Dec 2023, pp. 138122–138140
Embedded stereo vision systems based on traditional approaches often require a disparity refinement process to enhance image quality. Weighted median filter (WMF)-based processors are commonly employed for their excellent refinement performance. However, when implemented on a field-programmable gate array (FPGA), WMF-based processors face a trade-off between hardware resource utilization and refinement performance. To address this trade-off, we previously proposed a new disparity refinement processor based on the hybrid max-median filter (HMMF). However, our earlier work did not guarantee flawless operation in large occluded and texture-less regions, particularly in areas with numerous holes. In order to overcome this limitation of conventional processors, we proposed a cell-based disparity refinement processor. This processor extends our previous HMMF-based disparity refinement processor. To evaluate its refinement performance, we conducted experiments using four types of publicly available stereo datasets. When comparing refinement performance, our proposed processor outperforms conventional processors when using the KITTI 2012 and 2015 stereo benchmark datasets. Additionally, the results demonstrate that our proposed processor exhibits superior refinement performance when applied to the Cityscapes and StereoDriving datasets in comparison to conventional processors. Furthermore, when considering hardware resource utilization, our proposed processor demonstrates lower resource requirements than conventional processors when implemented on an FPGA. Therefore, our proposed disparity refinement processor is well-suited for the disparity refinement process in stereo vision systems that require cost-effectiveness and high performance.
@article{choi_cell-based_2023, title = {Cell-{Based} {Refinement} {Processor} {Utilizing} {Disparity} {Characteristics} of {Road} {Environment} for {SGM}-{Based} {Stereo} {Vision} {Systems}}, volume = {11}, issn = {2169-3536}, url = {https://ieeexplore.ieee.org/document/10339275}, doi = {10.1109/ACCESS.2023.3338649}, journal = {IEEE Access}, author = {Choi, Cheol-Ho and Oh, Hyun Woo and Han, Joonhwan and Shin, Jungho}, month = dec, year = {2023}, pages = {138122--138140}, } - An SoC FPGA-based Integrated Real-time Image Processor for Uncooled Infrared Focal Plane ArrayHyun Woo Oh, Cheol-Ho Choi, Jeong Woo Cha, Hyunmin Choi, Joon Hwan Han, and Jung-Ho ShinEuromicro Conference on Digital System Design (DSD), Durres, Albania, Sep 2023, pp. 660–668
This paper presents an integrated image processor architecture designed for real-time interfacing and processing of high-resolution thermal video obtained from an uncooled infrared focal plane array (IRFPA) utilizing a modern system-onchip field-programmable gate array (SoC FPGA). Our processor provides a one-chip solution for incorporating non-uniformity correction (NUC) algorithms and contrast enhancement methods (CEM) to be performed seamlessly. We have employed NUC algorithms that utilize multiple coefficients to ensure robust image quality, free from ghosting effects and blurring. These algorithms include polynomial modeling-based thermal drift compensation (TDC), two-point correction (TPC), and run-time discrete flat field correction (FFC). To address the memory bottlenecks originating from the parallel execution of NUC algorithms in real-time, we designed accelerators and parallel caching modules for pixel-wise algorithms based on a multi-parameter polynomial expression. Furthermore, we designed a specialized accelerator architecture to minimize the interrupted time for run-time FFC. The implementation on the XC7Z020CLG400 SoC FPGA with the QuantumRed VR thermal module demonstrates that our image processing module achieves a throughput of 60 frames per second (FPS) when processing 14-bit 640×480 resolution infrared video acquired from an uncooled IRFPA.
@inproceedings{oh_soc_2023, address = {Durres, Albania}, title = {An {SoC} {FPGA}-based {Integrated} {Real}-time {Image} {Processor} for {Uncooled} {Infrared} {Focal} {Plane} {Array}}, isbn = {979-8-3503-4419-6}, url = {https://ieeexplore.ieee.org/document/10456855}, doi = {10.1109/DSD60849.2023.00095}, booktitle = {{Euromicro} {Conference} on {Digital} {System} {Design} ({DSD})}, publisher = {IEEE}, author = {Oh, Hyun Woo and Choi, Cheol-Ho and Cha, Jeong Woo and Choi, Hyunmin and Han, Joon Hwan and Shin, Jung-Ho}, month = sep, year = {2023}, pages = {660--668}, } - Disparity Refinement Processor Architecture utilizing Horizontal and Vertical Characteristics for Stereo Vision SystemsCheol-Ho Choi, and Hyun Woo OhEuromicro Conference on Digital System Design (DSD), Durres, Albania, Sep 2023, pp. 220–226
In embedded stereo vision systems based on semiglobal matching, the matching accuracy of the initial disparity map can be degraded because of various factors. To solve this problem, weighted median-based disparity refinement hardware architectures are utilized to improve the matching accuracy. However, for the conventional hardware architectures, there is a trade-off between hardware resource utilization and refinement performance when they are implemented on a field programmable gate array (FPGA). Therefore, in this paper, we propose a hybrid max-median filter and its hardware architecture to improve the refinement performance and reduce hardware resource utilization. To evaluate the refinement performance, we used two public stereo datasets. When using the various window sizes for KITTI 2012 and 2015 stereo benchmark datasets, the proposed hardware architecture showed better matching accuracy performance compared with the conventional hardware architectures. In terms of the hardware resource utilization, when implemented on an FPGA, the proposed hardware architecture has low requirements for all types of hardware resources. That is, the proposed hardware architecture overcomes the trade-off between hardware resource utilization and refinement performance.
@inproceedings{choi_disparity_2023, address = {Durres, Albania}, title = {Disparity {Refinement} {Processor} {Architecture} utilizing {Horizontal} and {Vertical} {Characteristics} for {Stereo} {Vision} {Systems}}, isbn = {979-8-3503-4419-6}, url = {https://ieeexplore.ieee.org/document/10456793}, doi = {10.1109/DSD60849.2023.00040}, booktitle = {{Euromicro} {Conference} on {Digital} {System} {Design} ({DSD})}, publisher = {IEEE}, author = {Choi, Cheol-Ho and Oh, Hyun Woo}, month = sep, year = {2023}, pages = {220--226}, } - RF2P: A Lightweight RISC Processor Optimized for Rapid Migration from IEEE-754 to PositHyun Woo Oh, Seongmo An, Won Sik Jeong, and Seung Eun LeeACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), Vienna, Austria, Aug 2023, pp. 1–6
This paper presents a lightweight processor and evaluation platform for migrating from IEEE-754 to posit arithmetic, with an optimized posit arithmetic unit (PAU) supporting existing floating-point instructions. The PAU features a reconfigurable divider architecture for diverse operating conditions and lightweight square root logic. The platform includes a posit-optimized compiler, divider generator, JTAG environment builder, and programmable logic controller. The experimental results demonstrate the successful execution of legacy IEEE-754 code with a small additional workload and up to 60.09 times the performance improvement through hardware acceleration. Additionally, the PAU and divider consume 11.00% and 57.87% fewer LUTs, respectively, compared to the best prior works.
@inproceedings{oh_rf2p_2023, address = {Vienna, Austria}, title = {{RF2P}: {A} {Lightweight} {RISC} {Processor} {Optimized} for {Rapid} {Migration} from {IEEE}-754 to {Posit}}, isbn = {979-8-3503-1175-4}, url = {https://ieeexplore.ieee.org/document/10244582/}, doi = {10.1109/ISLPED58423.2023.10244582}, booktitle = {{ACM}/{IEEE} {International} {Symposium} on {Low} {Power} {Electronics} and {Design} ({ISLPED})}, publisher = {IEEE}, author = {Oh, Hyun Woo and An, Seongmo and Jeong, Won Sik and Lee, Seung Eun}, month = aug, year = {2023}, pages = {1--6}, } - The Design of Optimized RISC Processor for Edge Artificial Intelligence Based on Custom Instruction Set ExtensionHyun Woo Oh, and Seung Eun LeeIEEE Access, May 2023, pp. 49409–49421
Edge computing is becoming increasingly popular in artificial intelligence (AI) application development due to the benefits of local execution. One widely used approach to overcome hardware limitations in edge computing is heterogeneous computing, which combines a general-purpose processor with a domain-specific AI processor. However, this approach can be inefficient due to the communication overhead resulting from the complex communication protocol. To avoid communication overhead, the concept of an application-specific instruction set processor based on customizable instruction set architecture (ISA) has emerged. By integrating the AI processor into the processor core, on-chip communication replaces the complex communication protocol. Further, custom instruction set extension (ISE) reduces the number of instructions needed to execute AI applications. In this paper, we propose a uniprocessor system architecture for lightweight AI systems. First, we define the custom ISE to integrate the AI processor and GPP into a single processor, minimizing communication overhead. Next, we designed the processor based on the integrated core architecture, including the base core and the AI core, and implemented the processor on an FPGA. Finally, we evaluated the proposed architecture through simulation and implementation of the processor. The results show that the designed processor consumed 6.62% more lookup tables and 74% fewer flip-flops while achieving up to 193.88 times enhanced throughput performance and 52.75 times the energy efficiency compared to the previous system.
@article{oh_design_2023, title = {The {Design} of {Optimized} {RISC} {Processor} for {Edge} {Artificial} {Intelligence} {Based} on {Custom} {Instruction} {Set} {Extension}}, volume = {11}, issn = {2169-3536}, url = {https://ieeexplore.ieee.org/document/10124773/}, doi = {10.1109/ACCESS.2023.3276411}, journal = {IEEE Access}, author = {Oh, Hyun Woo and Lee, Seung Eun}, month = may, year = {2023}, pages = {49409--49421}, }
2022
- Evaluation of Posit Arithmetic on Machine Learning based on Approximate Exponential FunctionsHyun Woo Oh, Won Sik Jeong, and Seung Eun LeeInternational SoC Design Conference (ISOCC), Gangneung, Korea, Oct 2022, pp. 358–359
Recent advances in semiconductor technology lead to ongoing applications to adopt complex techniques based on neural networks. In line with this trend, the concept of optimizing real number arithmetic has been raised. In this paper, we evaluate the performance of the noble number system named posit on neural networks by analyzing the execution of approximate exponential functions, which is fundamental to several activation functions, with posit32 and float32. To implement the functions with posit arithmetic, we designed the software posit library consisting of basic arithmetic operations and conversion operations from/to C standard data types. The result shows that posit arithmetic reduces the average relative error rate by up to 87.12% on the exponential function.
@inproceedings{oh_evaluation_2022, address = {Gangneung, Korea}, title = {Evaluation of {Posit} {Arithmetic} on {Machine} {Learning} based on {Approximate} {Exponential} {Functions}}, isbn = {978-1-66545-971-6}, url = {https://ieeexplore.ieee.org/document/10031524/}, doi = {10.1109/ISOCC56007.2022.10031524}, booktitle = {{International} {SoC} {Design} {Conference} ({ISOCC})}, publisher = {IEEE}, author = {Oh, Hyun Woo and Jeong, Won Sik and Lee, Seung Eun}, month = oct, year = {2022}, pages = {358--359}, } - An Edge AI Device based Intelligent Transportation SystemYoungwoo Jeong, Hyun Woo Oh, Soohee Kim, and Seung Eun LeeJournal of Information and Communication Convergence Engineering, Sep 2022, pp. 166–173
Recently, studies have been conducted on intelligent transportation systems (ITS) that provide safety and convenience to humans. Systems that compose the ITS adopt architectures that applied the cloud computing which consists of a highperformance general-purpose processor or graphics processing unit. However, an architecture that only used the cloud computing requires a high network bandwidth and consumes much power. Therefore, applying edge computing to ITS is essential for solving these problems. In this paper, we propose an edge artificial intelligence (AI) device based ITS. Edge AI which is applicable to various systems in ITS has been applied to license plate recognition. We implemented edge AI on a fieldprogrammable gate array (FPGA). The accuracy of the edge AI for license plate recognition was 0.94. Finally, we synthesized the edge AI logic with Magnachip/Hynix 180nm CMOS technology and the power consumption measured using the Synopsys’s design compiler tool was 482.583mW.
@article{jeong_edge_2022, title = {An {Edge} {AI} {Device} based {Intelligent} {Transportation} {System}}, volume = {20}, issn = {2234-8883}, doi = {10.56977/jicce.2022.20.3.166}, number = {3}, journal = {Journal of Information and Communication Convergence Engineering}, author = {Jeong, Youngwoo and Oh, Hyun Woo and Kim, Soohee and Lee, Seung Eun}, month = sep, year = {2022}, pages = {166--173}, } - Intelligent Transportation System based on an Edge AIYoung Woo Jeong, Hyun Woo Oh, Su Yeon Jang, and Seung Eun LeeInternational Conference on Future Information & Communication Engineering, Jeju, Korea, Jan 2022, pp. 202–206
An intelligent transportation system (ITS) is a future system that combines various technologies to provide safety and convenience to humans. In order to implement ITS, previous systems applied an architecture that contains a large number of data centers with a high-performance general-purpose processor and graphics processing unit to collect the information of vehicles. However, this architecture not only requires a high network bandwidth but also causes the system to decrease power efficiency and makes security weak. In this paper, we propose an ITS based on an edge AI device which solves problems with the existing structure. We applied the edge AI device which is applicable to various systems in ITS to license plate recognition and the highest accuracy was 0.94. We implemented the edge AI device on a field programmable gate array (FPGA) and verified the feasibility of the entire system with the proposed edge AI device.
@inproceedings{jeong_intelligent_2022, address = {Jeju, Korea}, title = {Intelligent {Transportation} {System} based on an {Edge} {AI}}, volume = {13}, url = {https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11036311}, booktitle = {{International} {Conference} on {Future} {Information} \& {Communication} {Engineering}}, publisher = {The Korea Institute of Information and Communication Engineering}, author = {Jeong, Young Woo and Oh, Hyun Woo and Jang, Su Yeon and Lee, Seung Eun}, month = jan, year = {2022}, pages = {202--206}, } - A Local Interconnect Network Controller for Resource-Constrained Automotive DevicesKwonneung Cho, Hyun Woo Oh, Jeongeun Kim, Young Woo Jeong, and Seung Eun LeeIEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, Jan 2022, pp. 1–3
As the amount of data for automotive systems is increased, a dedicated communication controller for in-vehicle networks is required. This paper proposes a local interconnect network (LIN) controller for resource-constrained devices. The designed LIN controller efficiently reduces the workload of target devices by processing the LIN frame header, data response, and protocol errors. To demonstrate the feasibility of design, a Cortex-M0 is employed as a main processor and connected to the LIN controller. We implemented a LIN node by programming the processor, and the functionality of LIN controller was verified with a LIN frame analyzer and hardware scope. In addition, we analyzed the affection of communication loads on the processor and evaluated the benefits of LIN controller.
@inproceedings{cho_local_2022, address = {Las Vegas, NV, USA}, title = {A {Local} {Interconnect} {Network} {Controller} for {Resource}-{Constrained} {Automotive} {Devices}}, isbn = {978-1-66544-154-4}, url = {https://ieeexplore.ieee.org/document/9730493/}, doi = {10.1109/ICCE53296.2022.9730493}, booktitle = {{IEEE} {International} {Conference} on {Consumer} {Electronics} ({ICCE})}, publisher = {IEEE}, author = {Cho, Kwonneung and Oh, Hyun Woo and Kim, Jeongeun and Jeong, Young Woo and Lee, Seung Eun}, month = jan, year = {2022}, pages = {1--3}, }
2021
- A Multi-Core Controller for an Embedded AI System Supporting Parallel RecognitionSuyeon Jang, Hyun Woo Oh, Young Hyun Yoon, Dong Hyun Hwang, Won Sik Jeong, and Seung Eun LeeMicromachines, Jul 2021, pp. 852
Recent advances in artificial intelligence (AI) technology encourage the adoption of AI systems for various applications. In most deployments, AI-based computing systems adopt the architecture in which the central server processes most of the data. This characteristic makes the system use a high amount of network bandwidth and can cause security issues. In order to overcome these issues, a new AI model called federated learning was presented. Federated learning adopts an architecture in which the clients take care of data training and transmit only the trained result to the central server. As the data training from the client abstracts and reduces the original data, the system operates with reduced network resources and reinforced data security. A system with federated learning supports a variety of client systems. To build an AI system with resource-limited client systems, composing the client system with multiple embedded AI processors is valid. For realizing the system with this architecture, introducing a controller to arbitrate and utilize the AI processors becomes a stringent requirement. In this paper, we propose an embedded AI system for federated learning that can be composed flexibly with the AI core depending on the application. In order to realize the proposed system, we designed a controller for multiple AI cores and implemented it on a field-programmable gate array (FPGA). The operation of the designed controller was verified through image and speech applications, and the performance was verified through a simulator.
@article{jang_multi-core_2021, title = {A {Multi}-{Core} {Controller} for an {Embedded} {AI} {System} {Supporting} {Parallel} {Recognition}}, volume = {12}, issn = {2072-666X}, url = {https://www.mdpi.com/2072-666X/12/8/852}, doi = {10.3390/mi12080852}, number = {8}, journal = {Micromachines}, author = {Jang, Suyeon and Oh, Hyun Woo and Yoon, Young Hyun and Hwang, Dong Hyun and Jeong, Won Sik and Lee, Seung Eun}, month = jul, year = {2021}, pages = {852}, } - ASimOV: A Framework for Simulation and Optimization of an Embedded AI AcceleratorDong Hyun Hwang, Chang Yeop Han, Hyun Woo Oh, and Seung Eun LeeMicromachines, Jul 2021, pp. 838
Artificial intelligence algorithms need an external computing device such as a graphics processing unit (GPU) due to computational complexity. For running artificial intelligence algorithms in an embedded device, many studies proposed light-weighted artificial intelligence algorithms and artificial intelligence accelerators. In this paper, we propose the ASimOV framework, which optimizes artificial intelligence algorithms and generates Verilog hardware description language (HDL) code for executing intelligence algorithms in field programmable gate array (FPGA). To verify ASimOV, we explore the performance space of k-NN algorithms and generate Verilog HDL code to demonstrate the k-NN accelerator in FPGA. Our contribution is to provide the artificial intelligence algorithm as an end-to-end pipeline and ensure that it is optimized to a specific dataset through simulation, and an artificial intelligence accelerator is generated in the end.
@article{hwang_asimov_2021, title = {{ASimOV}: {A} {Framework} for {Simulation} and {Optimization} of an {Embedded} {AI} {Accelerator}}, volume = {12}, issn = {2072-666X}, shorttitle = {{ASimOV}}, url = {https://www.mdpi.com/2072-666X/12/7/838}, doi = {10.3390/mi12070838}, number = {7}, journal = {Micromachines}, author = {Hwang, Dong Hyun and Han, Chang Yeop and Oh, Hyun Woo and Lee, Seung Eun}, month = jul, year = {2021}, pages = {838}, } - The Design of a 2D Graphics Accelerator for Embedded SystemsHyun Woo Oh, Ji Kwang Kim, Gwan Beom Hwang, and Seung Eun LeeElectronics, Feb 2021, pp. 469
Recently, advances in technology have enabled embedded systems to be adopted for a variety of applications. Some of these applications require real-time 2D graphics processing running on limited design specifications such as low power consumption and a small area. In order to satisfy such conditions, including a specific 2D graphics accelerator in the embedded system is an effective method. This method reduces the workload of the processor in the embedded system by exploiting the accelerator. The accelerator assists the system to perform 2D graphics processing in real-time. Therefore, a variety of applications that require 2D graphics processing can be implemented with an embedded processor. In this paper, we present a 2D graphics accelerator for tiny embedded systems. The accelerator includes an optimized line-drawing operation based on Bresenham’s algorithm. The optimized operation enables the accelerator to deal with various kinds of 2D graphics processing and to perform the line-drawing instead of the system processor. Moreover, the accelerator also distributes the workload of the processor core by removing the need for the core to access the frame buffer memory. We measure the performance of the accelerator by implementing the processor, including the accelerator, on a field-programmable gate array (FPGA), and ascertaining the possibility of realization by synthesizing using the 180 nm CMOS process.
@article{oh_design_2021, title = {The {Design} of a {2D} {Graphics} {Accelerator} for {Embedded} {Systems}}, volume = {10}, issn = {2079-9292}, url = {https://www.mdpi.com/2079-9292/10/4/469}, doi = {10.3390/electronics10040469}, number = {4}, journal = {Electronics}, author = {Oh, Hyun Woo and Kim, Ji Kwang and Hwang, Gwan Beom and Lee, Seung Eun}, month = feb, year = {2021}, pages = {469}, } - Lossless Decompression Accelerator for Embedded Processor with GUIGwan Beom Hwang, Kwon Neung Cho, Chang Yeop Han, Hyun Woo Oh, Young Hyun Yoon, and Seung Eun LeeMicromachines, Jan 2021, pp. 145
The development of the mobile industry brings about the demand for high-performance embedded systems in order to meet the requirement of user-centered application. Because of the limitation of memory resource, employing compressed data is efficient for an embedded system. However, the workload for data decompression causes an extreme bottleneck to the embedded processor. One of the ways to alleviate the bottleneck is to integrate a hardware accelerator along with the processor, constructing a system-on-chip (SoC) for the embedded system. In this paper, we propose a lossless decompression accelerator for an embedded processor, which supports LZ77 decompression and static Huffman decoding for an inflate algorithm. The accelerator is implemented on a field programmable gate array (FPGA) to verify the functional suitability and fabricated in a Samsung 65 nm complementary metal oxide semiconductor (CMOS) process. The performance of the accelerator is evaluated by the Canterbury corpus benchmark and achieved throughput up to 20.7 MB/s at 50 MHz system clock frequency.
@article{hwang_lossless_2021, title = {Lossless {Decompression} {Accelerator} for {Embedded} {Processor} with {GUI}}, volume = {12}, issn = {2072-666X}, url = {https://www.mdpi.com/2072-666X/12/2/145}, doi = {10.3390/mi12020145}, number = {2}, journal = {Micromachines}, author = {Hwang, Gwan Beom and Cho, Kwon Neung and Han, Chang Yeop and Oh, Hyun Woo and Yoon, Young Hyun and Lee, Seung Eun}, month = jan, year = {2021}, pages = {145}, } - Vision-based Parking Occupation Detecting with Embedded AI ProcessorKwon Neung Cho, Hyun Woo Oh, and Seung Eun LeeIEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, Jan 2021, pp. 1–2
Recently, as the interest of smart parking system is increasing, the various methods for detecting parking occupation are under study. In this paper, we present a vision-based parking occupation detection with embedded AI processor. By employing a fisheye lens camera, multiple parking slot states are identified in one device. We measure the recognition rate of the AI processor in the proposed system and determine the optimized configuration with software simulator. The highest recognition rate is measured at 94.48% in the configuration of 64 number of training data with 256 bytes data size.
@inproceedings{cho_vision-based_2021, address = {Las Vegas, NV, USA}, title = {Vision-based {Parking} {Occupation} {Detecting} with {Embedded} {AI} {Processor}}, isbn = {978-1-72819-766-1}, url = {https://ieeexplore.ieee.org/document/9427661/}, doi = {10.1109/ICCE50685.2021.9427661}, booktitle = {{IEEE} {International} {Conference} on {Consumer} {Electronics} ({ICCE})}, publisher = {IEEE}, author = {Cho, Kwon Neung and Oh, Hyun Woo and Lee, Seung Eun}, month = jan, year = {2021}, pages = {1--2}, }
2020
- Design of 32-bit Processor for Embedded SystemsHyun Woo Oh, Kwon Neung Cho, and Seung Eun LeeInternational SoC Design Conference (ISOCC), Yeosu, Korea, Oct 2020, pp. 306–307
In this paper, we propose a 32-bit processor for the embedded system. In order to provide less area and low power operation, we adopt MIPS instruction set architecture (ISA) to our processor. The processor consists of five pipeline stages to reduce the critical path. In order to solve the data hazard in pipeline stages, we design the data forwarding unit and stall unit with optimized bubble insertion. The processor is implemented on a field programmable gate array (FPGA), and we verify the functionality of the processor and measure the performance by using the Dhrystone benchmark. The Dhrystone MIPS (DMIPS) is measured at 27.71 at 50 MHz operation.
@inproceedings{oh_design_2020, address = {Yeosu, Korea}, title = {Design of 32-bit {Processor} for {Embedded} {Systems}}, isbn = {978-1-72818-331-2}, url = {https://ieeexplore.ieee.org/document/9332944/}, doi = {10.1109/ISOCC50952.2020.9332944}, booktitle = {{International} {SoC} {Design} {Conference} ({ISOCC})}, publisher = {IEEE}, author = {Oh, Hyun Woo and Cho, Kwon Neung and Lee, Seung Eun}, month = oct, year = {2020}, pages = {306--307}, }