C.-C. Chen

About Me

I am a Deep Learning Product Research Engineer at NVIDIA, specializing in Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs), and Generative AI. My work focuses on bridging the gap between state-of-the-art research and scalable enterprise solutions—specifically in visual document understanding, analytical data visualization, and deep research agents leveraging NVIDIA NIM and the NeMo framework.

Research Impact

Analyzing scholarly contributions through citations and collaborative reach.

Co-author Network

Primary / NVIDIA
Advisor / UT
Salesforce
Collaborator
Use Ctrl + Scroll to zoom

Interactive: Click nodes to view joint papers below.

Research Topics & Methods

GenAI / LLM
Vision / OCR
Autonomous
Use Ctrl + Scroll to zoom

Key areas derived from publications & patents. Click to view papers.

Recent Updates

NVIDIA Nemotron-Parse 1.1

Kateryna Chumachenko, Amala Sanjay Deshmukh, ... Chia-Chih Chen, et al.

arXiv 2025
Abstract We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances capabilities across general OCR, markdown formatting, and structured table parsing.

BLIP-3: A Family of Open Large Multimodal Models

Le Xue, Manli Shu, ... Chia-Chih Chen, et al.

ICCV 2025 Findings
Abstract Presented the xGen-MM (BLIP-3) framework and models at ICCV Findings. A comprehensive suite of open large multimodal models designed for scalability and diverse multimodal tasks.

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, ... Chia-Chih Chen, et al.

CVPR 2024
Abstract Accepted to CVPR 2024. We proposed a novel framework to align instructional image editing with human preferences using a reward model and diffusion model fine-tuning.

Experience & Education

Deep Learning Product Research Engineer

NVIDIA | Nov 2023 – Present

Research and development in the Enterprise Product Group (EPG). Focusing on NVIDIA NIM and NeMo framework.

Lead Engineer, Data Science

Salesforce AI Research | Palo Alto, CA | Nov 2019 – Nov 2023

Multimodal LLM, Generative AI for graphical layout design and image editing, Document AI, OCR, and Microservices.

Multimodal LLM Generative AI OCR Stable Diffusion

Senior Computer Vision Engineer

NVIDIA | Santa Clara, CA | Dec 2015 – Oct 2019

  • Autonomous vehicle SDK development.
  • Computer vision, deep neural networks design and training for self-driving cars.

Computer Scientist

Create Technologies Inc | Monterey, CA | May 2014 – Dec 2015

Computer vision, machine learning, deep neural networks, and Android development.

Computer Vision Research Scientist

Sealed Air Corporation | Mountain View, CA | Jan 2012 – Apr 2014

(Also served as Principal Scientist in April 2014). Applied computer vision solutions to industrial problems.

Early Career & Research

Research Assistant

University of Texas at Austin | Austin, TX | Aug 2008 – Dec 2011

Development of algorithms for the recognition of human activities and human-vehicle interactions from surveillance videos.

Research Intern

Kitware Inc. | Albany, NY | May 2011 – Jul 2011

Summer Intern

MediaTek | Jun 2008 – Aug 2008

Panorama software development.

Education

Ph.D. in Electrical and Computer Engineering

University of Texas at Austin | Dec 2011

Dissertation: "Recognizing Human Activities from Low-Resolution Videos"

M.S.E. in Electrical and Computer Engineering

University of Texas at Austin | May 2006

B.S. in Electrical and Computer Engineering

National Chiao Tung University | June 2002

Selected Conference Papers

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Le Xue, Manli Shu, ... Chia-Chih Chen, et al.

ICCV 2025 Findings
Abstract Introduces BLIP-3, a framework for developing Large Multimodal Models (LMMs). We release 4B and 14B models that demonstrate competitive performance among open-source LMMs, featuring the ability to comprehend interleaved image-text inputs and rigorous evaluation across single and multi-image benchmarks.

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, ... Chia-Chih Chen, et al.

CVPR 2024
Abstract We present a framework to harness human feedback for instructional visual editing. By learning a reward function to capture underlying user preferences and fine-tuning diffusion models, HIVE significantly improves alignment between editing instructions and visual outputs.

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu, Chia-Chih Chen, et al.

ECCV 2024
Abstract Reformulates layout generation as a detection problem. LayoutDETR inherits high quality from generative modeling while satisfying content-aware requirements, detecting reasonable locations and scales for multimodal elements in a background image.

View Invariant Human Action Recognition Using Histograms of 3D Joints

Lu Xia, Chia-Chih Chen, et al.

CVPR Workshops 2012
Abstract Presents a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures, demonstrating significant view invariance on 3D action datasets.

Modeling Human Activities as Speech

Chia-Chih Chen, J. K. Aggarwal

CVPR 2011
Abstract Drawing on structural analogies between speech and motion, this paper introduces the "action spectrogram," a novel space-time-frequency representation for human activity recognition. This method characterizes body movements by modeling the likelihood time series of local interest patterns, effectively adapting speech analysis techniques to visual data.

Journal Articles

Detection of Object Abandonment Using Temporal Logic

Medha Bhargava, Chia-Chih Chen, et al.

Machine Vision and Applications, 2009
Abstract Describes a novel framework for a smart threat detection system that uses computer vision to capture, exploit and interpret the temporal flow of events related to the abandonment of an object in crowded environments.

Datasets

VIRAT Video Dataset

Sangmin Oh, Anthony Hoogs, ... Chia-Chih Chen, et al.

CVPR 2011
Abstract A large-scale benchmark dataset for event recognition in surveillance video. Designed to assess the performance of event recognition algorithms in realistic scenes with wide coverage and natural actions.

UT-Tower Dataset

Chia-Chih Chen, et al.

ICPR 2010
Abstract Aerial View Activity Classification Challenge Dataset. Introduced for the SDHA 2010 contest to encourage research in recognition of human activities from low-resolution aerial videos.

Dissertation

Recognizing Human Activities from Low-Resolution Videos

Chia-Chih Chen

Ph.D. Dissertation, UT Austin, 2011
Abstract This dissertation presents a series of approaches to address challenges in recognizing human activities from low-resolution videos, including shadow removal, speech-like activity modeling, and human-vehicle interaction reasoning.

Patents & Applications

Issued Dec 27, 2022 • NVIDIA Corporation

Mansi Rankawat, Jian Yao, Dong Zhang, Chia-Chih Chen

Issued May 4, 2021 • NVIDIA Corporation

Yifang Xu, Xin Liu, Chia-Chih Chen, et al.

Filed Jul 1, 2023 • Application No. 70689.273US01

Ning Yu, Chia-Chih Chen, et al.

Innovations in incorporating human feedback into instructional visual editing pipelines, related to the HIVE framework.

Filed Dec 1, 2022 • Application No. 104383 US

Shu Zhang, Xinyi Yang, Chia-Chih Chen, et al.

Methodologies for automated layout generation and multimodal interface design.

Technical Blogs

Notebooks & Open Source Projects

Code & Tutorials

Notebooks

Nemotron Parse v1.1 Cookbook

Step-by-step guide for parsing PDFs and documents using the Nemotron Parse model.

StarCoder2 Finetuning

Tutorial and scripts for fine-tuning StarCoder2 on custom datasets using NVIDIA NeMo.

Open Source Projects

BannerGen

A library for multi-modality banner generation, automating layout and asset creation.

LayoutDETR

Official implementation of "LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer".