About Me
I am a Deep Learning Product Research Engineer at NVIDIA, specializing in Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs), and Generative AI. My work focuses on bridging the gap between state-of-the-art research and scalable enterprise solutions—specifically in visual document understanding, analytical data visualization, and deep research agents leveraging NVIDIA NIM and the NeMo framework.
Previously, I served as a Lead Engineer in Data Science at Salesforce AI Research. There, I trained models and built applications for Document AI, GA'd a high-accuracy generalized OCR model, and co-invented systems for ad layout generation and instruction-based image editing. Prior to that, I was a Senior Computer Vision Software Engineer at NVIDIA, where I designed perception DNNs for autonomous driving. My work on lane detection and drivable freespace networks were patented and showcased at major industry events like GTC and CES.
Earlier in my career, I developed computer vision systems for diverse industrial applications, ranging from restaurant analytics to medical visual application. I earned my Ph.D. in Computer Vision and Machine Learning from the University of Texas at Austin, advised by Prof. J. K. Aggarwal, where I pioneered research on human activity recognition using 3D ToF depth data as well as aerial low-resolution imagery.
Research Impact
Analyzing scholarly contributions through citations and collaborative reach.
Co-author Network
Interactive: Click nodes to view joint papers below.
Research Topics & Methods
Key areas derived from publications & patents. Click to view papers.
Recent Updates
NVIDIA Nemotron-Parse 1.1
Kateryna Chumachenko, Amala Sanjay Deshmukh, ... Chia-Chih Chen, et al.
BLIP-3: A Family of Open Large Multimodal Models
Le Xue, Manli Shu, ... Chia-Chih Chen, et al.
HIVE: Harnessing Human Feedback for Instructional Visual Editing
Shu Zhang, Xinyi Yang, ... Chia-Chih Chen, et al.
Experience & Education
Deep Learning Product Research Engineer
NVIDIA | Nov 2023 – Present
Research and development in the Enterprise Product Group (EPG). Focusing on NVIDIA NIM and NeMo framework.
Lead Engineer, Data Science
Salesforce AI Research | Palo Alto, CA | Nov 2019 – Nov 2023
Multimodal LLM, Generative AI for graphical layout design and image editing, Document AI, OCR, and Microservices.
Senior Computer Vision Engineer
NVIDIA | Santa Clara, CA | Dec 2015 – Oct 2019
- Autonomous vehicle SDK development.
- Computer vision, deep neural networks design and training for self-driving cars.
Computer Scientist
Create Technologies Inc | Monterey, CA | May 2014 – Dec 2015
Computer vision, machine learning, deep neural networks, and Android development.
Computer Vision Research Scientist
Sealed Air Corporation | Mountain View, CA | Jan 2012 – Apr 2014
(Also served as Principal Scientist in April 2014). Applied computer vision solutions to industrial problems.
Early Career & Research
Research Assistant
University of Texas at Austin | Austin, TX | Aug 2008 – Dec 2011
Development of algorithms for the recognition of human activities and human-vehicle interactions from surveillance videos.
Research Intern
Kitware Inc. | Albany, NY | May 2011 – Jul 2011
Summer Intern
MediaTek | Jun 2008 – Aug 2008
Panorama software development.
Ph.D. in Electrical and Computer Engineering
University of Texas at Austin | Dec 2011
Dissertation: "Recognizing Human Activities from Low-Resolution Videos"
M.S.E. in Electrical and Computer Engineering
University of Texas at Austin | May 2006
B.S. in Electrical and Computer Engineering
National Chiao Tung University | June 2002
Selected Conference Papers
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue, Manli Shu, ... Chia-Chih Chen, et al.
HIVE: Harnessing Human Feedback for Instructional Visual Editing
Shu Zhang, Xinyi Yang, ... Chia-Chih Chen, et al.
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
Ning Yu, Chia-Chih Chen, et al.
View Invariant Human Action Recognition Using Histograms of 3D Joints
Lu Xia, Chia-Chih Chen, et al.
Modeling Human Activities as Speech
Chia-Chih Chen, J. K. Aggarwal
Journal Articles
Detection of Object Abandonment Using Temporal Logic
Medha Bhargava, Chia-Chih Chen, et al.
Datasets
VIRAT Video Dataset
Sangmin Oh, Anthony Hoogs, ... Chia-Chih Chen, et al.
UT-Tower Dataset
Chia-Chih Chen, et al.
Dissertation
Recognizing Human Activities from Low-Resolution Videos
Chia-Chih Chen
Patents & Applications
Determining drivable free-space for autonomous vehicles
US 11,537,139 B2Issued Dec 27, 2022 • NVIDIA Corporation
Mansi Rankawat, Jian Yao, Dong Zhang, Chia-Chih Chen
Real-time detection of lanes and boundaries by autonomous vehicles
US 10,997,433 B2Issued May 4, 2021 • NVIDIA Corporation
Yifang Xu, Xin Liu, Chia-Chih Chen, et al.
Systems and methods for feedback based instructional visual editing
Pending ApplicationFiled Jul 1, 2023 • Application No. 70689.273US01
Ning Yu, Chia-Chih Chen, et al.
Innovations in incorporating human feedback into instructional visual editing pipelines, related to the HIVE framework.
Systems and methods for digital publication interface generation
Pending ApplicationFiled Dec 1, 2022 • Application No. 104383 US
Shu Zhang, Xinyi Yang, Chia-Chih Chen, et al.
Methodologies for automated layout generation and multimodal interface design.
Technical Blogs
Multimodal Large Language Models
A comprehensive overview of Multimodal Large Language Models (MLLMs), explaining how these systems ingest and process diverse data types—including text, images, and audio—to enable advanced reasoning and generative capabilities.
Read Article →Turn Complex Documents into Usable Data with VLM
Introduction to NVIDIA Nemotron Parse 1.1, a VLM-based solution for high-precision text and table extraction from complex PDF documents to enhance RAG pipelines.
Read Article →Unlock Your LLM Coding Potential with StarCoder2
A guide to building custom coding assistants using StarCoder2, covering dataset preparation and fine-tuning techniques to boost developer productivity.
Read Article →BannerGen: A Library for Multi-Modality Banner Generation
Introducing BannerGen, an open-source library that leverages generative AI to automate the creation of high-quality, multimodal banner designs.
Read Article →Notebooks & Open Source Projects
Code & Tutorials
Notebooks
Step-by-step guide for parsing PDFs and documents using the Nemotron Parse model.
Tutorial and scripts for fine-tuning StarCoder2 on custom datasets using NVIDIA NeMo.
Open Source Projects
A library for multi-modality banner generation, automating layout and asset creation.
Official implementation of "LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer".