Education

Master of Application Informatics

2021 - 2024

University of Göttingen, Germany

Master Thesis

Improving Portability and Interoperability of Deep-Learning Workloads using ONNX

Investigated ONNX as a unified middleware to bridge fragmented AI ecosystems (e.g., PyTorch, TensorFlow) by evaluating its portability and interoperability across the full AI development lifecycle.

Key Contributions & Technical Details:

  • NLP Model Architecture: Designed a custom PyTorch Transformer Encoder from scratch (utilizing multi-head self-attention and layer normalization) to act as an automated ticket distributor, mapping natural language queries to the most suitable technical supporter among all possible candidates.
  • Data Pipeline: Engineered a comprehensive NLP data pipeline processing real-world tickets. Features included dialogue extraction, stopword removal, Gaussian-fitted statistical word-length analysis, Word2Vec embeddings, and multi-strategy data partitioning.
  • Training & Optimization: Systematically explored hyperparameters (learning rate, batch size, attention heads, encoder layers) over 300 epochs on the GWDG cluster using A100 GPUs. The model achieved 65% accuracy in recognizing ticket owners from full conversations, and 22% when predicting solely from the first submitted question.
  • Portability Evaluation: Exported the trained PyTorch model into TorchScript-ONNX and Dynamo-ONNX formats. Evaluated inference and retraining compatibility across diverse ONNX Runtime environments including Python (GPU/CPU), C, C++, Rust, and JavaScript (CPU).
  • Performance Analysis: Achieved a 10x inference speedup using Python with ONNX Runtime on GPUs compared to native PyTorch. Revealed that JavaScript achieved near-C++ speeds using native NPM bindings, while identified FFI overhead in Rust caused over 2x inference latency.
  • Interoperability & Distributed Learning: Overcame model conversion degradation by dynamically reconfiguring the learning rate via ONNX Runtime's exposed APIs, boosting retraining accuracy to 70%. Proposed a distributed ONNX-based framework enabling advanced architectures (Federated Learning, Hyper-FL, Personalized Learning) to satisfy strict data privacy requirements across heterogeneous devices without ecosystem lock-in.
NLP Transformer PyTorch ONNX ONNX Runtime TorchScript Dynamo A100 CUDA C/C++ Rust JavaScript Federated Learning Slurm

Research Projects

2023.10 - 2024.05

Project 1: Distributed Deep Learning with Golang

Investigated the viability and performance of Golang as a high-performance alternative to Python for scalable, distributed deep learning. Implemented a ResNet50 model from scratch using the Gorgonia package to leverage GPU acceleration. Designed robust deployment pipelines for HPC clusters using Docker and Singularity, and proposed a novel real-time, asynchronous weight-updating mechanism to optimize distributed and federated learning systems.

Key Contributions & Technical Details:

  • Deep Learning Implementation: Built and trained ResNet architectures (ResNet50, ResNet101, ResNet151) from scratch in Golang utilizing the Gorgonia library for automatic differentiation and multidimensional tensor operations with CUDA support.
  • HPC Environment Configuration: Engineered isolated execution environments using Singularity and Docker to compile and deploy native Golang binaries directly to the Emmy HPC cluster, solving complex and heavy Python dependency issue.
  • Performance Benchmarking: Conducted extensive performance evaluations comparing Golang and Python models. Analyzed GPU utilization, batch-size scalability, and time consumption across different execution environments (with and without Singularity).
  • Scalability Design: Laid the architectural groundwork for a complete distributed learning system by preparing the model for MPI-based gradient aggregation and concurrent data loading via Goroutines.
Golang Python Gorgonia CGO CUDA Docker Singularity HPC Distributed Systems MPI Deep Learning

Project 2: HPC Benchmarking & Performance Assessment

Evaluated High-Performance Computing (HPC) system capabilities through comprehensive benchmarking of computational velocity and storage I/O on heterogeneous architectures. The project focused on industry-standard benchmarks as well as a deep dive into the MiniBUDE molecular docking application.

Key Contributions & Technical Details:

  • System Benchmarking: Executed standard HPC benchmarks in the GWDG SCC cluster, including IO500 (MPI-based storage performance using ior and mdtest), HPL (dense linear system via LU factorization using OpenBLAS and OpenMPI), HPCG, and STREAM (memory bandwidth).
  • MiniBUDE Implementation & Scaling: Conducted an in-depth exploration of the MiniBUDE benchmark (simulating NDM-1 protein folding). Configured problem sizes (Small, Medium, Large) to analyze linear scalability and task assignment overheads.
  • CPU Parallelization & Fine-tuning: Analyzed multithreading performance using OpenMP and Julia. Showcased the performance impact of CPU core matching to threads in OpenMP (achieving linear scaling up to 16 cores), and demonstrated Julia's JIT compilation reaching single-threaded speeds up to 36 GFlops/s (7x faster than the OpenMP baseline).
  • GPU Acceleration & Offloading: Evaluated GPU acceleration on NVIDIA V100 architectures using CUDA and OpenCL. Achieved computational speeds exceeding 10 TFlops/s for large problem sizes in CUDA. Assessed the limitations of compiler dependencies like NVPTX64 for OpenMP target offloading and OpenACC.
HPC GWDG SCC MiniBUDE IO500 HPL HPCG STREAM OpenMP Julia CUDA OpenCL MPI

Project 3: Parallel Deep Learning Pipelines Using Go and MPI

Developed a distributed deep learning framework from scratch using Golang and Message Passing Interface (MPI) to accelerate neural network training across multiple nodes in HPC environments.

Key Contributions & Technical Details:

  • Network Implementation: Built a fully connected neural network entirely in Golang, implementing forward propagation, error backpropagation, L2 loss, and mini-batch gradient descent (MBGD). Utilized the gonum library for matrix operations.
  • MPI Integration: Leveraged CGO to integrate C-based MPI libraries into Golang, allowing for multi-node task parallelism and dynamic process allocation on the GWDG supercomputer cluster.
  • Distributed Training Strategies: Implemented and compared two distributed communication approaches for weight synchronization: a non-collective (point-to-point via send/recv) method and a collective (Allreduce) method.
  • Performance Analysis: Evaluated the speedup on the Iris and Intel Image Classification datasets. Demonstrated near-linear acceleration up to 16 nodes, beyond which MPI communication overhead began to offset parallelization benefits.
Golang MPI Distributed Deep Learning HPC CGO Gonum GWDG

Master of Physics

2016 - 2019

University of Göttingen, Germany

Master Thesis:

Simulated particle collisions at the LHC using MadGraph within the Standard Model and Higgs mechanism framework to explore parameter constraints for dark matter candidates predicted by the Inert Doublet Model.

Bachelor of Applied Informatics

2020 - 2021

University of Göttingen, Germany

Completed core computer science coursework and was admitted directly to the master's program.

Operating Systems Algorithms Data Structures Java C/C++

Bachelor of Physics

2009 - 2013

University of Shihezi, China

Bachelor Thesis:

Extended Maxwell's equations through a new formalization of electron spin to account for magnetic monopoles, offering an alternative explanation for the Lorentz force and the Hall effect.

Professional Experience

AI Engineering

2025.11 - 2026.03

Turing College

  • Built multiple end-to-end AI applications with customized tools and systematic prompt engineering, deployed on Kubernetes, managing the full lifecycle from development to production.
  • Integrated MCP to connect LLMs with external tools and APIs.
  • Built RAG pipelines with distributed Qdrant for hybrid search and context-aware retrieval.
  • Managed short- and long-term memory for user personalization, incorporating Human-in-the-Loop feedback mechanisms.
LLM LangGraph LangChain MCP RAG Qdrant Kubernetes OpenClaw

Competition Project: AgentX

2025.04 - 2025.05

GWDG

Built LynxNLI, an assistant agent that simplifies Linux operations in HPC environments through natural language. Served as core developer, owning the project from framework design to tool implementation.

LLM AI Agent Genertive AI LangGraph PydanticAI

Python Developer

2022.10 - 2023.09

GWDG

Built a content management system with integrated authentication, featuring a multi-layer security model for authentication and role-based permission control. Developed backend features for email-based notifications, keyword search, attribute filtering, timed tasks, and REST API endpoints with custom access rules. Owned testing and deployment, covering unit tests, integration tests, and production rollout.

Python Django Javascript Postgresql WSGI Linux RestAPI

Full-stack Web Developer

2022.02 - 2022.09

Eforsch

Built a full-stack platform for digitizing chemical and biological experiments, funded by the NBank Gründungsstipendium in Niedersachsen. Automated experiment workflows including calculations, report generation, and supply tracking. Managed the full project lifecycle as sole developer, from initial implementation through production deployment.

Vue Golang Django SQL Nginx Docker compose Cloud server

Robotics Engineer

2019.12 - 2020.05

Mianyang Lunqi Robotics Co., Ltd.

  • Deployed AI-based visual inspection systems for custom industrial automation applications.
  • Collaborated with university research groups to develop solutions for client-specific requirements.

Publications & Patents

Exploration for Distributed Learning Design with Golang
Silin Zhao
Technical ReportUniversität GöttingenApril 2024
Advisors: Sadegh Keshtkar and Julian Kunkel
Improving Portability and Interoperability of Deep-Learning-Workloads using ONNX
Silin Zhao
Journal ArticleApplied Intelligence (Springer Nature)2025
Under Review
Growing Domain-Specific LLMs Through Federated Split-Phase Learning: System Design and Analysis
Silin Zhao
Conference PaperFederated Learning Systems & Applications (FLICS)2026
Accepted
Patent: ZL2011 1 0004207.2
Silin Zhao
PatentChina National Intellectual Property Administration2015
Granted

Certificates

2026 AI Engineering Professional Certificate Turing College
2025 Kubernetes Certified Application Developer (CKAD) Udemy
2025 Performance Analysis of AI and HPC Workloads GWDG
2024 Deep Learning Bootcamp: Building and Deploying AI Models GWDG
2022 Gründungsstipendium Nbank Niedersachsen
2021 Certificate of Attendance of GWDG Scientific Compute Cluster GWDG
2019 IBM Data Science Professional Certificate Coursera
2017 PIER Graduate Week 2017 Confirmation of Attendance DESY
2013 Teacher Qualification Certificate (High School Physics) Ministry of Education, PRC

Specialized Skills

Large Language Models

  • LLM deployment with Ollama for local inference across multiple environments
  • OpenClaw and Hermes agent deployment and orchestration in Kubernetes
  • MCP server implementation for local service exposure and multi-server management
  • AI agent skill management with modular tool integration and custom workflows
  • Aichat project: conversational AI interface with extensible plugin architecture
  • RAG pipelines with hybrid search and context-aware retrieval using Qdrant
  • API Integration: experience with leading commercial APIs and local API implementations

High-Performance & Parallel Computing

  • Cluster Administration: Managed compute clusters, ensuring optimal resource allocation, scheduling efficiency, and system reliability for multiple concurrent workloads.
  • Performance Analysis: Performed deep-dive performance profiling to identify system bottlenecks and optimize compute resource utilization.
  • Parallel Computing Development: Engineered high-performance parallel computing solutions leveraging MPI for distributed memory and CUDA for GPU acceleration.
  • Distributed Systems Integration: Designed and implemented scalable distributed learning systems, utilizing Golang and MPI for efficient, low-latency inter-process communication across nodes.

Programming & Tools

Python Golang Rust C/C++ JavaScript PyTorch Django Kubernetes Docker LangChain LangGraph ONNX Slurm MPI Emacs (10+ years)

Languages

English (C1) German (C1) Chinese (Native)