Master of Application Informatics
2021 - 2024
University of Göttingen, Germany
Master Thesis
Improving Portability and Interoperability of Deep-Learning Workloads using ONNX
Investigated ONNX as a unified middleware to bridge fragmented AI ecosystems (e.g., PyTorch, TensorFlow) by evaluating its portability and interoperability across the full AI development lifecycle.
Key Contributions & Technical Details:
- NLP Model Architecture: Designed a custom PyTorch Transformer Encoder from scratch (utilizing multi-head self-attention and layer normalization) to act as an automated ticket distributor, mapping natural language queries to the most suitable technical supporter among all possible candidates.
- Data Pipeline: Engineered a comprehensive NLP data pipeline processing real-world tickets. Features included dialogue extraction, stopword removal, Gaussian-fitted statistical word-length analysis,
Word2Vecembeddings, and multi-strategy data partitioning. - Training & Optimization: Systematically explored hyperparameters (learning rate, batch size, attention heads, encoder layers) over 300 epochs on the GWDG cluster using A100 GPUs. The model achieved 65% accuracy in recognizing ticket owners from full conversations, and 22% when predicting solely from the first submitted question.
- Portability Evaluation: Exported the trained PyTorch model into TorchScript-ONNX and Dynamo-ONNX formats. Evaluated inference and retraining compatibility across diverse ONNX Runtime environments including Python (GPU/CPU), C, C++, Rust, and JavaScript (CPU).
- Performance Analysis: Achieved a 10x inference speedup using Python with ONNX Runtime on GPUs compared to native PyTorch. Revealed that JavaScript achieved near-C++ speeds using native NPM bindings, while identified FFI overhead in Rust caused over 2x inference latency.
- Interoperability & Distributed Learning: Overcame model conversion degradation by dynamically reconfiguring the learning rate via ONNX Runtime's exposed APIs, boosting retraining accuracy to 70%. Proposed a distributed ONNX-based framework enabling advanced architectures (Federated Learning, Hyper-FL, Personalized Learning) to satisfy strict data privacy requirements across heterogeneous devices without ecosystem lock-in.
Research Projects
2023.10 - 2024.05
Project 1: Distributed Deep Learning with Golang
Investigated the viability and performance of Golang as a high-performance alternative to Python for scalable, distributed deep learning. Implemented a ResNet50 model from scratch using the Gorgonia package to leverage GPU acceleration. Designed robust deployment pipelines for HPC clusters using Docker and Singularity, and proposed a novel real-time, asynchronous weight-updating mechanism to optimize distributed and federated learning systems.
Key Contributions & Technical Details:
- Deep Learning Implementation: Built and trained ResNet architectures (ResNet50, ResNet101, ResNet151) from scratch in Golang utilizing the Gorgonia library for automatic differentiation and multidimensional tensor operations with CUDA support.
- HPC Environment Configuration: Engineered isolated execution environments using Singularity and Docker to compile and deploy native Golang binaries directly to the Emmy HPC cluster, solving complex and heavy Python dependency issue.
- Performance Benchmarking: Conducted extensive performance evaluations comparing Golang and Python models. Analyzed GPU utilization, batch-size scalability, and time consumption across different execution environments (with and without Singularity).
- Scalability Design: Laid the architectural groundwork for a complete distributed learning system by preparing the model for MPI-based gradient aggregation and concurrent data loading via Goroutines.
Project 2: HPC Benchmarking & Performance Assessment
Evaluated High-Performance Computing (HPC) system capabilities through comprehensive benchmarking of computational velocity and storage I/O on heterogeneous architectures. The project focused on industry-standard benchmarks as well as a deep dive into the MiniBUDE molecular docking application.
Key Contributions & Technical Details:
- System Benchmarking: Executed standard HPC benchmarks in the GWDG SCC cluster, including IO500 (MPI-based storage performance using ior and mdtest), HPL (dense linear system via LU factorization using OpenBLAS and OpenMPI), HPCG, and STREAM (memory bandwidth).
- MiniBUDE Implementation & Scaling: Conducted an in-depth exploration of the MiniBUDE benchmark (simulating NDM-1 protein folding). Configured problem sizes (Small, Medium, Large) to analyze linear scalability and task assignment overheads.
- CPU Parallelization & Fine-tuning: Analyzed multithreading performance using OpenMP and Julia. Showcased the performance impact of CPU core matching to threads in OpenMP (achieving linear scaling up to 16 cores), and demonstrated Julia's JIT compilation reaching single-threaded speeds up to 36 GFlops/s (7x faster than the OpenMP baseline).
- GPU Acceleration & Offloading: Evaluated GPU acceleration on NVIDIA V100 architectures using CUDA and OpenCL. Achieved computational speeds exceeding 10 TFlops/s for large problem sizes in CUDA. Assessed the limitations of compiler dependencies like NVPTX64 for OpenMP target offloading and OpenACC.
Project 3: Parallel Deep Learning Pipelines Using Go and MPI
Developed a distributed deep learning framework from scratch using Golang and Message Passing Interface (MPI) to accelerate neural network training across multiple nodes in HPC environments.
Key Contributions & Technical Details:
- Network Implementation: Built a fully connected neural network entirely in Golang, implementing forward propagation, error backpropagation, L2 loss, and mini-batch gradient descent (MBGD). Utilized the
gonumlibrary for matrix operations. - MPI Integration: Leveraged CGO to integrate C-based MPI libraries into Golang, allowing for multi-node task parallelism and dynamic process allocation on the GWDG supercomputer cluster.
- Distributed Training Strategies: Implemented and compared two distributed communication approaches for weight synchronization: a non-collective (point-to-point via send/recv) method and a collective (Allreduce) method.
- Performance Analysis: Evaluated the speedup on the Iris and Intel Image Classification datasets. Demonstrated near-linear acceleration up to 16 nodes, beyond which MPI communication overhead began to offset parallelization benefits.