—
SageMaker HyperPod Recipes: Transforming Foundation Model Training for Data Scientists
Amazon Web Services (AWS) has made a significant leap in the realm of artificial intelligence and machine learning with the public release of SageMaker HyperPod recipes. This innovation is set to revolutionize how data scientists and developers, regardless of their proficiency level, can train and fine-tune foundation models (FMs) efficiently and effectively. For those unfamiliar with the term, foundation models are large-scale AI models that are pre-trained on vast datasets and can be adapted for various specific tasks.
The newly available SageMaker HyperPod recipes offer a streamlined approach to accessing optimized configurations for training and fine-tuning widely used foundation models. Among the models supported are Llama 3.1 405B, Llama 3.2 90B, and Mixtral 8x22B. These models have been well-received in the AI community for their versatility and performance.
The Introduction of SageMaker HyperPod
AWS first unveiled SageMaker HyperPod at the 2023 re:Invent conference. The purpose of this infrastructure is to drastically reduce the time required to train foundation models by up to 40 percent. It achieves this by distributing workloads across a vast array of computing resources, potentially exceeding a thousand nodes working in tandem. This distribution is facilitated through preconfigured libraries designed for scalable training.
SageMaker HyperPod simplifies the process of identifying necessary computing resources, planning optimal training strategies, and executing training tasks across various capacities, all while considering resource availability. This capability is particularly beneficial for organizations seeking to maximize the efficiency of their AI training operations.
How SageMaker HyperPod Recipes Simplify Model Training
The SageMaker HyperPod recipes are meticulously crafted by AWS to eliminate the often labor-intensive process of experimenting with different model setups. These recipes effectively automate numerous critical stages of the training process, including:
- Loading and managing datasets.
- Applying distributed training methodologies.
- Automating checkpoints for quick recovery from errors.
- Overseeing the complete training cycle.
By automating these tasks, the recipes free up valuable time for data scientists, allowing them to focus on refining their models rather than getting bogged down in complex configurations.
For users aiming to optimize further, SageMaker HyperPod recipes enable seamless transitions between GPU and Trainium-based instances, enhancing training performance while minimizing costs. Whether in development or production environments, these recipes can be run effortlessly within SageMaker HyperPod or through SageMaker training jobs.
Getting Started with SageMaker HyperPod Recipes
To explore the potential of SageMaker HyperPod recipes, users can visit the dedicated GitHub repository. This repository contains a variety of training recipes for popular foundation models, providing a comprehensive starting point for model training.
Once users have accessed the repository, they need to make minor adjustments to the recipe parameters to specify the instance type and dataset location. Executing the recipe then becomes as simple as running a single command line, yielding state-of-the-art performance.
The configuration process involves editing a file named
config.yaml
to define the model type and cluster setup. This step is crucial for tailoring the training process to specific needs.bash<br /> $ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git<br /> $ cd sagemaker-hyperpod-recipes<br /> $ pip3 install -r requirements.txt<br /> $ cd ./recipes_collections<br /> $ vim config.yaml<br />
Comprehensive Support for Training Environments
The recipes are versatile, supporting a range of environments, including SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon EKS (Elastic Kubernetes Service), and standard SageMaker training jobs. This flexibility allows users to choose the most suitable setup for their requirements.
For instance, when using SageMaker HyperPod with a Slurm orchestrator, users can specify detailed configurations such as the Meta Llama 3.1 405B language model, instance type (e.g., ml.p5.48xlarge), and data storage locations.
yaml<br /> defaults:<br /> cluster: slurm # Options: slurm / k8s / sm_jobs<br /> recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # Model to train<br /> debug: False # Enable debugging if needed<br /> instance_type: ml.p5.48xlarge # Choose supported instance types<br /> base_results_dir: # Designate storage for results, logs, etc.<br />
Users can further modify model-specific settings within the YAML file, adjusting parameters such as the number of accelerator devices, training precision, and logging options for monitoring through TensorBoard.
Executing SageMaker HyperPod Recipes
To run a recipe using SageMaker HyperPod with Slurm, users must first set up a SageMaker HyperPod cluster as per the provided instructions. After connecting to the head node and accessing the Slurm controller, the edited recipe can be copied for execution. A helper file generates a Slurm submission script, allowing for a dry run before commencing the actual training.
bash<br /> $ python3 main.py --config-path recipes_collection --config-name=config<br />
Post-training, the model is automatically saved to the designated data location, ensuring easy access for further use or analysis.
When using SageMaker HyperPod with Amazon EKS, users clone the recipe, install necessary requirements, and edit the recipe configuration on their local machine. Establishing a connection between the local machine and the EKS cluster is essential before utilizing the HyperPod Command Line Interface (CLI) to execute the recipe.
bash<br /> $ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora \<br /> --persistent-volume-claims fsx-claim:data \<br /> --override-parameters \<br /> '{<br /> "recipes.run.name": "hf-llama3-405b-seq8k-gpu-qlora",<br /> "recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",<br /> "cluster": "k8s",<br /> "cluster_type": "k8s",<br /> "container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",<br /> "recipes.model.data.train_dir": "<your_train_data_dir>",<br /> "recipes.model.data.val_dir": "<your_val_data_dir>",<br /> }'<br />
Alternatively, SageMaker training jobs can be run using the SageMaker Python SDK. This approach allows for the execution of PyTorch training scripts while incorporating custom training recipes.
python<br /> recipe_overrides = {<br /> "run": {<br /> "results_dir": "/opt/ml/model",<br /> },<br /> "exp_manager": {<br /> "exp_dir": "",<br /> "explicit_log_dir": "/opt/ml/output/tensorboard",<br /> "checkpoint_dir": "/opt/ml/checkpoints",<br /> }, <br /> "model": {<br /> "data": {<br /> "train_dir": "/opt/ml/input/data/train",<br /> "val_dir": "/opt/ml/input/data/val",<br /> },<br /> },<br /> }<br /> pytorch_estimator = PyTorch(<br /> output_path=<output_path>,<br /> base_job_name=f"llama-recipe",<br /> role=<role>,<br /> instance_type="p5.48xlarge",<br /> training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",<br /> recipe_overrides=recipe_overrides,<br /> sagemaker_session=sagemaker_session,<br /> tensorboard_output_config=tensorboard_output_config,<br /> )<br />
Throughout the training process, model checkpoints are automatically stored in Amazon Simple Storage Service (Amazon S3). This feature ensures swift recovery from any training interruptions or instance restarts.
Availability and Further Resources
The SageMaker HyperPod recipes are now accessible via the SageMaker HyperPod recipes GitHub repository. For those interested in diving deeper, additional information can be found on the SageMaker HyperPod product page and within the Amazon SageMaker AI Developer Guide.
AWS encourages users to experiment with these recipes and provide feedback through their standard AWS Support channels or via AWS re:Post for SageMaker.
In conclusion, SageMaker HyperPod recipes offer a transformative approach to model training, making it more accessible and efficient for developers and data scientists worldwide. By leveraging this tool, users can significantly enhance their AI capabilities while optimizing resource utilization.
For more details, you can visit the SageMaker HyperPod product page or the Amazon SageMaker AI Developer Guide.
For more Information, Refer to this article.