Who We Are:
We are a rapidly growing embodied AI company revolutionizing human labor. Leveraging cutting-edge robotics and advanced artificial intelligence, we develop transformative technologies that redefine how work is done across multiple industries—empowering businesses to streamline operations, boost productivity, and unlock new possibilities.
Overview:
As a Distributed Systems Engineer, you will design, implement, and optimize scalable systems that power modern AI and machine learning applications. You will work closely with engineering teams to build robust infrastructure, ensuring system reliability and performance.
Your Responsibilities:
Design
- Design and develop distributed systems and tools in Python and C++.
- Writing Shell Scripts, building Docker Containers, setting up training/inference cluster and automate the training/inference pipeline.
- Writing communication layers and connectors between different ML-related microservices/components.
Optimization
- Optimize system performance and scalability.
- Debug and resolve complex distributed system issues.
- Implement fault-tolerant and high-availability features.
- Collaborate with cross-functional teams to integrate distributed systems into larger platforms.
Qualifications:
Education and Experience
- Bachelor’s or Master’s degree in Computer Science or related fields.
- Experience working with asynchronous, parallel ML serving framework such as Torchserve, vLLM, LMDeploy, NVIDIA Triton.
- Experience in designing and deploying distributed systems.
- Experience with training ML/LLMs models in distributed settings.
Skills
- Proficiency in Python and Pytorch.
- Proficiency in bash scripting, docker building and ML orchestration tools such as Kubernetes, Kserve, Slurm, Torch Distributed.
- Proficiency working with cloud-based ML platforms such as AWS Sagemaker, GCP VertexAI, as well as other cloud services such as storage, docker registry.
- Proficiency working with distributed communication system such as MPI, NCCL, as well as general-purpose communication system such as RabbitMQ, MQTT.
- Strong understanding of networking, concurrency, and multithreading.
What We Offer:
- Wellpass (gym membership)
- Free meals at the workplace
- Flexible working hours
- A motivated team and an open corporate culture
- Competitive compensation and excellent career development opportunities