© 2014 by Chih-Yao Ma

Last update: Feb. 2018.



My name is Chih-Yao Ma. I am an ECE Ph.D. student at Georgia Tech. 

My research interest lies in the intersection of vision and language. In recent years, I have been primarily focusing on the research fields at the intersection of computer vision, natural language processing, and temporal reasoning. I have conducted researches on large-scale video classification, fine-grained human action recognition, relational reasoning for video understanding, visually grounded video captioning, and vision-and-language navigation agents.


Georgia Institute of Technology, US                       2014 - Present

M.S./Ph.D. in School of Electrical and Computer Engineering 

National Chiao Tung University, Taiwan                     2010 - 2011

M.S. in College of Electrical and Computer Engineering

National Chiao Tung University, Taiwan                     2006 - 2010

B.S. in College of Electrical and Computer Engineering


  • High-Tech Talent ScholarshipMinistry of Education, Taiwan

    • Annual scholarship, granted for USD 126,000

  • Academic Achievement Award,  Institute of Electro-Optical Engineering, NCTU

    • 2/60 students​

     Work Experience

Research Intern                                        2018 Summer

Salesforce Research

Research Intern                                  2017 Summer/Fall

Machine Learning @ NEC Labs

Graduate Research Assistant               2014 - Present

OLIVES Lab @ Georgia Tech

Research Assistant                                              2012 - 2014

CommLab @ NCTU

     Area of Interests



Grounded Objects and Interactions for Video Captioning                      May 2017 to Dec. 2017


  • Dynamically and progressively discover higher-order object interactions as the basis for video captioning using PyTorch.

  • Achieved state-of-the-art performance on large-scale video captioning dataset: ActivityNet Captions.

Long-term Video Classification in YouTube-8M [Poster]                            Jan. 2017 to May 2017


  • Implemented and adapted various RNNs and MANNs (LayerNorm, RHN, Hierarchical RNN, NAS, and DNC) in TensorFlow.

  • Benchmarking accuracy and speed in modeling long-term video content.

  • Deep Learning

  • Computer Vision

  • Machine Learning

  • Video Understanding​

Higher-order Object Interactions for Video Understanding                    May 2017 to Dec. 2017


  • Proposed generic recurrent higher-order object interactions module for video understanding problems with PyTorch. 

  • Achieved state-of-the-art performance on large-scale action recognition dataset: Kinetics.

Activity Recognition with RNN and Temporal-ConvNet [GitHub]          Jan. 2016 to Mar. 2017


  • Demonstrate a strong baseline two-stream ConvNet using ResNet-101.

  • Propose two networks to integrate spatiotemporal information: temporal segment RNN and Inception-style Temporal-ConvNet.

  • Achieved state-of-the-art performance on UCF101(94%) and HMDB51(69%) in Torch.

Partially Occluded Object Tracking with RGB-D Camera Network      Nov. 2014 to Dec. 2016


  • Cooperating with Walmart and SoftWear in developing an Over-Head Vision System for closed-loop control in the sewing industry.

  • Developed a color histogram and frequency domain based approach to track multiple partially occluded objects using Kinect depth sensor network.

     Technical Skills


  • Deep Learning

    • Programming Languages: Python, Lua, C/C++

    • Frameworks: PyTorch, TensorFlow, Torch, MXNet, Caffe

  • OS

    • Linux, MacOS, Windows

  • Engineering Software

    • Matlab, Mathematica, Maple


  • Software Tools

    • E-Prime, 3ds Max, LightTool

  • Equipment

    • Tobii eye-tracker, Conoscope

  • Typesetting

    • Latex, Microsoft Office


Chinese - Native

English - Fluent

Self-Monitoring Visual-Textual Co-Grounded Navigation Agent           May 2018 to Sept. 2018


  • Introduced a self-monitoring agent consists of a visual-textual co-grounding module and progress monitor using Pytorch. 

  • Set a new state-of-the-art performance on the Vision-and-Language Navigation task (8% absolute success rate improvement).

The Regretful Navigation Agent                                                                     Sept. 2018 to Nov. 2018


  • Equipped a navigation agent with Regret Module to decide when to rollback or forward using PyTorch. 

  • Proposed a Progress Marker allows the agent to access the progress estimate on each navigable direction.

  • Set a new state-of-the-art performance on the Vision-and-Language Navigation task (5% SR ↑ and 8% SPL ↑).

Character Recognition in Natural Images [Poster]                                    Nov. 2014 to May 2015


  • Proposed three types of hand-crafted features and unsupervised feature extraction using k-means clustering for character recognition.

  • Developed K-NN, SVM, and Random Forest to train models from both hand-crafted or learned features.

  • Ranked 9th in Kaggle competition.

Learning-based Saliency Model with Depth Information                       Feb. 2013 to Aug. 2013


  • Utilized high, mid, low-level, and depth features to find out how human beings look at the contents of different images.

  • Proposed a machine learning based saliency model for 3D content which outperformed the state-of-the-art approaches on different datasets.

Eye Fixation Database for 3D Image Saliency Detection                         Dec. 2012 to Jun. 2013


  • Designed an eye-tracking experiment to collect data with E-Prime and Tobii eye-tracker.

  • Established and released an eye-tracking database for 3D images.

  • Analyzed the viewing behavior of human when watching 3D content.

Multi-Zone Digital Crosstalk Reduction Method for 3D System            Nov. 2010 to Jun. 2011


  • Developed an algorithm to reduce crosstalk by utilizing the inherent structure of patterned retarder 3D display.

Simulation Platform for Patterned Retarder 3D Display                         July. 2010 to Feb. 2011


  • Analyzed, measured and evaluated the crosstalk on patterned retarder 3D display.

  • Established a simulation platform for patterned retarder 3D display to predict the light profile with different fabrication parameter.

Crosstalk Suppression by Image Processing in 3D Display                      Feb. 2009 to Jun. 2010


  • Proposed a novel crosstalk reduction method without using extra hardware components.

  • Suppressed crosstalk successfully on both stereoscopic and auto-stereoscopic 3D display and drastically improved user viewing experience.

2010 - present

2010 - present

This site was designed with the
website builder. Create your website today.
Start Now