Senior ML Infrastructure Engineer

(m/f/d)

CLUSTER MANAGEMENT • CONTAINERIZATION & ORCHESTRATION • INFRASTRUCTURE OPTIMIZATION
You will own the hardware and software stack that enables our scientists to make breakthrough discoveries.
Frankfurt am Main, Germany (On-site)

The Role

As an ML Infrastructure Engineer, you will own the hardware and software stack that enables our scientists to simulate brain dynamics. You will bridge the gap between bare-metal hardware and high-level JAX code, ensuring our researchers have the compute power and stability required to push the boundaries of AI.#

Key Responsibilities:

  • Cluster Management: Manage, maintain, and optimize our local high-performance compute cluster (Linux-based, NVIDIA GPUs). You are the owner of the hardware environment.
  • Containerization & Orchestration: Design and manage robust containerized environments (Docker/Kubernetes) to ensure reproducible and scalable research workflows.
  • Infrastructure Optimization: Maintain and evolve the core ML software infrastructure (Python/JAX codebase), focusing on efficiency, reproducibility, and scalability.
  • Research Operations (MLOps): Execute and monitor large-scale model training and inference runs in tight cooperation with research scientists.
  • Technical Support: Provide hands-on hardware and software support to the research team, troubleshooting bottlenecks in the research workflow.

Your Profile

We are seeking technically proficient engineers with 5+ years of industry experience who love Linux and want to apply their skills to scientific discovery.

Essential Technical Requirements:

  • Education: M.Sc. in Computer Science, Engineering, Physics, or equivalent industry experience. Ph.D a plus.
  • Experience: 5+ years of work experience with a proven track record.
  • Linux Mastery: Deep expertise in Linux administration is non-negotiable. You must be comfortable managing clusters, users, and bare-metal hardware, shell scripting, and hardware configuration.
  • Container Administration: Proven production experience with Docker and/or Kubernetes is required. You know how to orchestrate complex workloads efficiently.
  • ML Frameworks: Strong experience with Python and deep learning frameworks, specifically JAX and PyTorch.
  • Bonus: Prior experience specifically in ML Infrastructure administration (e.g., Slurm, Docker/Kubernetes for ML).
  • Bonus: Proven track record of Open Source contributions or personal software projects.
  • Bonus: Experience in computational modeling or neuroscience (understanding the "why" behind the code).

Soft Skills:

  • Goal-driven and proactive: Strong self-management skills with the ability to take ownership of the infrastructure stack.
  • Collaborative Mindset: A collaborative mindset; you enjoy enabling others to succeed.
  • Communication: Excellent written and verbal communication skills in English. Knowledge of German is a plus, but not required.