Title
AWS re:Invent 2023 - Netflix Maestro: Orchestrating scaled data & ML workflows in the cloud (NFX308)
Summary
- Netflix Maestro is a powerful workflow orchestrator used internally at Netflix to manage and automate ETL pipelines and machine learning workflows.
 - Maestro is a fully managed service that provides workflow as a service to thousands of Netflix users, ensuring reliability and scalability.
 - It features a workflow engine, alerting service, error classification services, and a user interface with templates and domain-specific language for easy workflow definition.
 - Maestro supports a variety of use cases, including data processing, model training, A/B testing, and more.
 - It was built in-house due to the lack of existing solutions that could handle Netflix's scale and variety of workflows.
 - Maestro's architecture includes an API gateway, a core workflow engine with versioning and triggering support, and integration with downstream services via Kafka.
 - The Maestro DSL (Domain-Specific Language) is available in YAML, Python, and Java, making workflow definitions readable, reproducible, and debuggable.
 - Maestro supports parameterized workflows with features like conditional branching and sub-workflows, as well as dynamic code injection for custom logic.
 - It is extensible, allowing users to create new step types and bring their own compute resources.
 - Workflows in Maestro are executed efficiently and reliably, with each job running in isolation using Docker containers and Paper Mill for notebook execution.
 
Insights
- Netflix's decision to build Maestro in-house highlights the unique challenges faced by large-scale data-driven companies and the limitations of existing workflow orchestration tools.
 - Maestro's design emphasizes user-friendliness and flexibility, catering to a diverse user base with different technical backgrounds and preferences.
 - The use of domain-specific languages and parameterization in Maestro simplifies the process of defining complex workflows, making it accessible to both engineers and non-engineers.
 - The integration of Maestro with other Netflix tools like Metaflow suggests a cohesive ecosystem for data and ML operations at Netflix.
 - Maestro's ability to handle spiky and uneven loads, with tens of thousands of workflows and millions of jobs per day, demonstrates its robustness and the importance of scalability in workflow orchestration.
 - The presentation of Maestro at AWS re:Invent 2023 indicates a willingness by Netflix to share its internal tools and practices with the broader tech community, potentially influencing the development of similar tools in the industry.