Biography
Enrollment Date: 2011
Graduation Date:2014
Degree:M.S.
Defense Date:2013.05.29
Advisors:Yangdong Deng
Department:Institute of Microelectronics,Tsinghua University
Title of Dissertation/Thesis:Typical machine learning algorithms acceleration on GPUs
Abstract:
Machine learning is a technique which makes prediction based on the model trained by a series of algorithms. Machine learning is widely used in speech recognition and information retrieval, and there are various ongoing research works in object recognition and image classification. Models and training data are becoming bigger and bigger as the problems in machine learning are scaling up. It takes weeks to train a mainstream deep learning network and thus accelerating the machine learning process becomes a hot research topic both in academy and industry. There are several choices to build a machine learning system, such as multi-core CPUs, GPUs and heterogeneous APUs. Multi-core CPUs perform better than the other two platforms given an application with lots of instructions of branches and predictions. While general purpose GPUs are good at data-intensive problems, like linear algebra cal- culations. Heterogeneous APUs fuses CPU cores with GPU compute units and thus re- duce the communication cost introduced by PCI-e interface. As the computing patterns of machine learning algorithms are complicated, the first step to accelerate machine learning applications should be choose the appropriate platform. On the other hand, it is not easy to leverage the numerous threads, the high data throughput and the high memory bandwidth provided by GPUs to remove the performance bottlenecks in machine learning applica- tions. So the second step should be optimizing the most time-consuming kernels in the application to improve the performance. We demonstrate the two steps of machine learning algorithm acceleration on GPU with two significant cases: the LambdaMART algorithm in learning to rank and the deep autoencoder with L-BFGS training algorithm. We focus on the task division and the selec- tion of platform in the deep autoencoder case while how to optimize kernels is presented in the LambdaMART algorithm. Parallel algorithms for GPUs implemented in this work achieves state-of-the-art performance on the same training data. The two optimization steps can be a reference for AI researchers and parallel programmers.