Chia-Chih Chen | C.-C. Chen

C.-C. Chen

About Me

I am a Deep Learning Product Research Engineer at NVIDIA, specializing in Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs), and Generative AI. My work focuses on bridging the gap between state-of-the-art research and scalable enterprise solutions—specifically in visual document understanding, analytical data visualization, and deep research agents leveraging NVIDIA Nemotron, NIM, and the NeMo framework.

Research Impact

Analyzing scholarly contributions through citations and collaborative reach.

Co-author Network

Primary / NVIDIA

Advisor / UT

Salesforce

Collaborator

Use Ctrl + Scroll to zoom

Interactive: Click nodes to view joint papers below.

Research Topics & Methods

GenAI / LLM

Vision / OCR

Autonomous

Use Ctrl + Scroll to zoom

Key areas derived from publications & patents. Click to view papers.

Recent Updates

NVIDIA Nemotron Nano V2 VL

Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, ... Chia-Chih Chen, et al.

arXiv 2025

Abstract Advances the Nemotron vision-language series for document understanding, long-video comprehension, and reasoning, with improved efficiency and long-context throughput.

NVIDIA Nemotron-Parse 1.1

Kateryna Chumachenko, Amala Sanjay Deshmukh, ... Chia-Chih Chen, et al.

arXiv 2025

Abstract We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances capabilities across general OCR, markdown formatting, and structured table parsing.

BLIP-3: A Family of Open Large Multimodal Models

Le Xue, Manli Shu, ... Chia-Chih Chen, et al.

ICCV 2025 Findings

Abstract Presented the xGen-MM (BLIP-3) framework and models at ICCV Findings. A comprehensive suite of open large multimodal models designed for scalability and diverse multimodal tasks.

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, ... Chia-Chih Chen, et al.

CVPR 2024

Abstract Accepted to CVPR 2024. We proposed a novel framework to align instructional image editing with human preferences using a reward model and diffusion model fine-tuning.

Experience & Education

Deep Learning Product Research Engineer

NVIDIA | Nov 2023 – Present

Research and development in the Enterprise Product Group (EPG). Focusing on NVIDIA NIM and NeMo framework.

Lead Engineer, Data Science

Salesforce AI Research | Palo Alto, CA | Nov 2019 – Nov 2023

Multimodal LLM, Generative AI for graphical layout design and image editing, Document AI, OCR, and Microservices.

Multimodal LLM Generative AI OCR Stable Diffusion

Senior Computer Vision Engineer

NVIDIA | Santa Clara, CA | Dec 2015 – Oct 2019

Autonomous vehicle SDK development.
Computer vision, deep neural networks design and training for self-driving cars.

Computer Scientist

Create Technologies Inc | Monterey, CA | May 2014 – Dec 2015

Computer vision, machine learning, deep neural networks, and Android development.

Computer Vision Research Scientist

Sealed Air Corporation | Mountain View, CA | Jan 2012 – Apr 2014

(Also served as Principal Scientist in April 2014). Applied computer vision solutions to industrial problems.

Early Career & Research

Research Assistant

University of Texas at Austin | Austin, TX | Aug 2008 – Dec 2011

Development of algorithms for the recognition of human activities and human-vehicle interactions from surveillance videos.

Research Intern

Kitware Inc. | Albany, NY | May 2011 – Jul 2011

Summer Intern

MediaTek | Jun 2008 – Aug 2008

Panorama software development.

Education

Ph.D. in Electrical and Computer Engineering

University of Texas at Austin | Dec 2011

Dissertation: "Recognizing Human Activities from Low-Resolution Videos"

M.S.E. in Electrical and Computer Engineering

University of Texas at Austin | May 2006

B.S. in Electrical and Computer Engineering

National Chiao Tung University | June 2002

Selected Conference Papers

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Le Xue, Manli Shu, ... Chia-Chih Chen, et al.

ICCV 2025 Findings

Abstract Introduces BLIP-3, a framework for developing Large Multimodal Models (LMMs). We release 4B and 14B models that demonstrate competitive performance among open-source LMMs, featuring the ability to comprehend interleaved image-text inputs and rigorous evaluation across single and multi-image benchmarks.

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu, Chia-Chih Chen, et al.

ECCV 2024

Abstract Reformulates layout generation as a detection problem. LayoutDETR inherits high quality from generative modeling while satisfying content-aware requirements, detecting reasonable locations and scales for multimodal elements in a background image.

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, ... Chia-Chih Chen, et al.

CVPR 2024

Abstract We present a framework to harness human feedback for instructional visual editing. By learning a reward function to capture underlying user preferences and fine-tuning diffusion models, HIVE significantly improves alignment between editing instructions and visual outputs.

View Invariant Human Action Recognition Using Histograms of 3D Joints

Lu Xia, Chia-Chih Chen, et al.

CVPR Workshops 2012

Abstract Presents a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures, demonstrating significant view invariance on 3D action datasets.

Modeling Human Activities as Speech

Chia-Chih Chen, J. K. Aggarwal

CVPR 2011

Abstract Drawing on structural analogies between speech and motion, this paper introduces the "action spectrogram," a novel space-time-frequency representation for human activity recognition. This method characterizes body movements by modeling the likelihood time series of local interest patterns, effectively adapting speech analysis techniques to visual data.

View Full List on Google Scholar

Journal Articles

Detection of Object Abandonment Using Temporal Logic

Medha Bhargava, Chia-Chih Chen, et al.

Machine Vision and Applications, 2009

Abstract Describes a novel framework for a smart threat detection system that uses computer vision to capture, exploit and interpret the temporal flow of events related to the abandonment of an object in crowded environments.

Datasets

VIRAT Video Dataset

Sangmin Oh, Anthony Hoogs, ... Chia-Chih Chen, et al.

CVPR 2011

Abstract A large-scale benchmark dataset for event recognition in surveillance video. Designed to assess the performance of event recognition algorithms in realistic scenes with wide coverage and natural actions.

UT-Tower Dataset

Chia-Chih Chen, et al.

ICPR 2010

Abstract Aerial View Activity Classification Challenge Dataset. Introduced for the SDHA 2010 contest to encourage research in recognition of human activities from low-resolution aerial videos.

Dissertation

Recognizing Human Activities from Low-Resolution Videos

Chia-Chih Chen

Ph.D. Dissertation, UT Austin, 2011

Abstract This dissertation presents a series of approaches to address challenges in recognizing human activities from low-resolution videos, including shadow removal, speech-like activity modeling, and human-vehicle interaction reasoning.

Patents

Systems and methods for multimodal layout designs of digital publications

US 12,536,720 B2

Issued Jan 27, 2026 • Salesforce, Inc.

Ning Yu, Chia-Chih Chen, et al.

Methodologies for automated layout generation and multimodal interface design.

Systems and methods for feedback based instructional visual editing

US 12,494,004 B2

Issued Dec 9, 2025 • Salesforce, Inc.

Shu Zhang, Xinyi Yang, Yihao Feng, Chia-Chih Chen, et al.

Innovations in incorporating human feedback into instructional visual editing pipelines, related to the HIVE framework.

Determining drivable free-space for autonomous vehicles

US 11,537,139 B2

Issued Dec 27, 2022 • NVIDIA Corporation

Mansi Rankawat, Jian Yao, Dong Zhang, Chia-Chih Chen

Real-time detection of lanes and boundaries by autonomous vehicles

US 10,997,433 B2

Issued May 4, 2021 • NVIDIA Corporation

Yifang Xu, Xin Liu, Chia-Chih Chen, et al.

Technical Blogs

NVIDIA Technical Blog

How to Build a Document Processing Pipeline for RAG with Nemotron

A step-by-step guide to building a multimodal document processing pipeline for RAG with Nemotron, covering extraction, embeddings, reranking, and source-grounded answer generation.

Read Article →

NVIDIA Glossary

Multimodal Large Language Models

A comprehensive overview of Multimodal Large Language Models (MLLMs), explaining how these systems ingest and process diverse data types—including text, images, and audio—to enable advanced reasoning and generative capabilities.

Read Article →

NVIDIA Technical Blog

Turn Complex Documents into Usable Data with VLM

Introduction to NVIDIA Nemotron Parse 1.1, a VLM-based solution for high-precision text and table extraction from complex PDF documents to enhance RAG pipelines.

Read Article →

NVIDIA Technical Blog

Unlock Your LLM Coding Potential with StarCoder2

A guide to building custom coding assistants using StarCoder2, covering dataset preparation and fine-tuning techniques to boost developer productivity.

Read Article →

Salesforce AI Research

BannerGen: A Library for Multi-Modality Banner Generation

Introducing BannerGen, an open-source library that leverages generative AI to automate the creation of high-quality, multimodal banner designs.

Read Article →

Video Tutorials & Livestreams

NVIDIA Developer YouTube

Build a Document Intelligence Pipeline With Nemotron RAG | Nemotron Labs

A Nemotron Labs livestream on building a multimodal document-intelligence pipeline with Nemotron RAG, covering extraction, chunking, embeddings, reranking, and grounded answers over complex PDFs.

Watch Video →

NVIDIA Developer YouTube

How to Build a Document Processing Pipeline for RAG with Nemotron

A step-by-step video tutorial on building a document-processing pipeline for RAG with Nemotron, including extraction, multimodal embeddings, reranking, and citation-grounded responses.

Watch Video →

Notebooks & Open Source Projects

Code & Tutorials

Notebooks

➜

Intelligent Document Processing with Nemotron RAG

Tutorial for building a document processing pipeline for RAG with Nemotron.

➜

Nemotron Parse v1.1 Cookbook

Step-by-step guide for parsing PDFs and documents using the Nemotron Parse model.

➜

StarCoder2 Finetuning

Tutorial and scripts for fine-tuning StarCoder2 on custom datasets using NVIDIA NeMo.

Open Source Projects

BannerGen

A library for multi-modality banner generation, automating layout and asset creation.

LayoutDETR

Official implementation of "LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer".

About Me

Research Impact

Co-author Network

Details

Research Topics & Methods

Details

Recent Updates

Experience & Education

Deep Learning Product Research Engineer

NVIDIA | Nov 2023 – Present

Lead Engineer, Data Science

Salesforce AI Research | Palo Alto, CA | Nov 2019 – Nov 2023

Senior Computer Vision Engineer

NVIDIA | Santa Clara, CA | Dec 2015 – Oct 2019

Computer Scientist

Create Technologies Inc | Monterey, CA | May 2014 – Dec 2015

Computer Vision Research Scientist

Sealed Air Corporation | Mountain View, CA | Jan 2012 – Apr 2014

Early Career & Research

Research Assistant

Research Intern

Summer Intern

Ph.D. in Electrical and Computer Engineering

University of Texas at Austin | Dec 2011

M.S.E. in Electrical and Computer Engineering

University of Texas at Austin | May 2006

B.S. in Electrical and Computer Engineering

National Chiao Tung University | June 2002

Selected Conference Papers

Journal Articles

Datasets

Dissertation

Patents

Technical Blogs

How to Build a Document Processing Pipeline for RAG with Nemotron

Multimodal Large Language Models

Turn Complex Documents into Usable Data with VLM

Unlock Your LLM Coding Potential with StarCoder2

BannerGen: A Library for Multi-Modality Banner Generation

Video Tutorials & Livestreams

Build a Document Intelligence Pipeline With Nemotron RAG | Nemotron Labs

How to Build a Document Processing Pipeline for RAG with Nemotron

Notebooks & Open Source Projects

Code & Tutorials

Notebooks

Open Source Projects