SageMaker is one of the central services for data scientists. SageMaker took much attention over the last few years. While many data scientists are trying to learn more about it, developers of SageMaker are constantly working on new solutions and abilities to bring it to the SageMaker. SageMaker provides tools for both the production and experimentations environments.
When it comes to production, SageMaker benefits from the power of dockerization to create isolated scalable computation machines and workflow. If you are not familiar with these concepts, you can learn them from AWS Documentation for training and processings data jobs.
Passing data to these containers is one of the challenging steps of productionizing an ML application. In this blog post, I am going to break down into detail all the paths and possible ways to send or receive data to a training or processing job container.
Methods of loading and saving datasets in SageMaker
S3 is one of the most popular AWS services that is being used for storing data and integrated with many other AWS services. SageMaker is not an exception and has implemented native solutions to load and save data into S3.
Normally, you might run two types of algorithms on SageMaker: a processing job and a training job. You can use this documentation to see how to load data from S3 to a SageMaker processing job, and this documentation for a training job. The code below demonstrates how to define an input source for a processing job:
from sagemaker.processing import ProcessingInput ProcessingInput( source="s3://sagemaker-sample-data/processing/census/census-income.csv", destination="/opt/ml/processing/input" )
There are two options for loading data from S3 in SageMaker, File mode, and Pipe mode. File mode will copy the input files from an S3 bucket into the container. This is the default method and is the popular method of communicating with S3. In the code above, Since we did not pass the
S3InputMode, the file mode is considered as the method of loading data. Also, there is a pipe mode that streams the data instead of copying it and, as a result, will not use storage inside the container.
There is one more well-known method of communication with S3 which we will cover later in this post but first, let us talk about the destination paths of those inputs inside the containers.
Understanding Container's paths
When we are in SageMaker training or processing container, we are working with directories under the path
/opt/ml. The SageMaker itself also uses this directory as the root directory to save and load all files related to the training/processing algorithm. For instance, hyperparameters are stored as a JSON dictionary in
There are three states you wish to transfer data from S3 to a container or vice versa. Based on that, we can classify the paths in a container:
- Input Data: The very raw data you wish to feed to your preprocessing step must be saved to
- Processed Data: The output of your preprocessing step must be saved under the
processingdirectory under different channels of your choice. For instance,
- Trained Model: The artifacts of your training model from the training step must be saved to
- Training Output: The extra data when training a model must be saved to
Using S3 SDK, Boto3
Boto3 is the most available solution to download or upload data to AWS S3. It is easily accessible inside the containers. I usually won't go with this option as it is not completely SageMaker friendly, and there is also a better native way to load data from S3 into the container. However, there are some cases in which I prefer to use Boto3 alongside the SageMaker native options to load data, like loading a big configuration. Nevertheless, the code below demonstrates how to download or upload data from S3 using Boto3
import boto3 s3_client = boto3.client('s3') # Upload the file to S3 s3_client.upload_file('hello.txt', 'MyBucket', 'hello-remote.txt') # Download the file from S3 s3_client.download_file('MyBucket', 'hello-remote.txt', 'hello2.txt')