3 Ways to Pass Data Between Azure ML Pipeline Steps

The issue with machine learning pipelines is that they need to pass state from one step to another. When this works, it’s a beautiful thing to behold. When it doesn’t, well, it’s not pretty, and I think the clip below sums this up pretty well.

made a Rube Goldberg machine pic.twitter.com/gWRNnmm5Ic
— COLiN BURGESS (@Colinoscopy) April 30, 2020

Azure ML Pipelines are no stranger to this need for passing data between steps, so you have a variety of options at your disposal. This means it’s not always easy to find the best one, and I’ve often seen people confused when trying to pick the best option. So I wrote this article to try and clear some of that confusion.

My idea was to try out three approaches – Datasets, PipelineData, and OutputFileDatasetConfig. I would use them to pass data between some simple writer/reader steps, and document their pros and cons. You see, for some reason I had been under the impression that the three approaches were mostly similar, allowing you to write and read data, with only minor API differences between them.

I thought it would be hard to recommend one over another, and that there would be no clear “best” approach. That’s not exactly the case, as you’ll see in the rest of this article.

1. Using File and Tabular Datasets as Pipeline Inputs

Datasets are a way to explore, transform, and manage data in Azure Machine Learning.

They work relatively well as pipeline step inputs, and not at all as outputs – that’s what PipelineData and OutputFileDatasetConfig are for. And even as inputs, they are a bit limited - you can’t for example update a dataset in one step, and then pass the updated dataset reference to another step. Even if both steps use that dataset as an input, they’re bound to the same specific version. If one step updates the dataset, it creates a new version, but the following steps won’t receive that as inputs. They’ll receive the previous version instead, the one they were bound to, because that’s what you must have wanted to do all along.

You can kinda get away with using Datasets as inputs and outputs by writing & reading directly to & from the Dataset store (using Dataset.get_by_name and Dataset.register). However, this approach would be just like using global variables in your code, and you generally don’t want to use global variables because your code will become a giant tangled mess that no one, not even you will be able to understand six months from now. So you shouldn’t do it.

One other reason I have for keeping my dataset usage to a minimum is that you can’t really test steps that use dataset inputs locally¹. So if you want to see how your code behaves, you’ll have to run the respective pipeline steps in the cloud, and wait for the compute nodes to be provisioned, and for the data to be copied, and for the code to run, and before you know it it’s lunch time, and oh my what a big lunch you’ve had, and anyway you’ve forgotten what you wanted to try out so better start it again. It’s nice to be able to test stuff locally.

In the code below you’ll see how to send both tabular and file datasets to a script step. While using the tabular dataset is pretty straightforward, the file dataset can be either sent as a direct reference, mounted, or downloaded to the node. If you’re the type of person that cares about performance, you might want to know that as_download performs better than as_mount, as per this Stack Overflow answer.

You’ll also notice my file dataset is created using a series of http addresses. This was the simplest way I could find to create a file dataset, but I’ve found it quite confusing to use so I don’t really recommend it.

import pandas as pd
from azureml.core import Dataset

datasets = Dataset.get_all(ws)
datastore = ws.get_default_datastore()
if not 'TabularData' in datasets:
    Dataset.Tabular.register_pandas_dataframe(pd.DataFrame({'Fuzzy': ['Wuzzy']}), datastore, 'TabularData')

if not 'FileData' in datasets:
    tempFileData = Dataset.File.from_files(
        ['https://vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg',
        'https://vladiliescu.net/images/3-ways-to-pass-data-between-azure-ml-pipeline-steps.jpg',
        'https://vladiliescu.net/images/reverse-engineering-automated-ml.jpg']
        )
    tempFileData.register(ws, name='FileData', create_new_version=True)

tabularData = Dataset.get_by_name(ws, 'TabularData')
fileData = Dataset.get_by_name(ws, 'FileData')


read_datasets_step = PythonScriptStep(
    name='The Dataset Reader',
    script_name='read-datasets.py',
    inputs=[tabularData.as_named_input('Table'), fileData.as_named_input('Files'), fileData.as_named_input('Files_mount').as_mount(), fileData.as_named_input('Files_download').as_download()],
    compute_target=compute_target,
    source_directory='./dataset-reader',
    allow_reuse=False,
)

And below is how to access those dataset inputs. Again, the tabular dataset is very straightforward to use and a general joy to work with, to_pandas_dataframe and all that.

The file dataset is a bit more complicated though. If you’ve chosen to send it as a reference, then you can go ahead and mount it manually, and then do your thing. If you’re sending it as_download or as_mount, you’ll get a path reference which you can parse and process however you see fit.

There’s a but.

Remember when I said I don’t recommend using file datasets created from web addresses? Sure you do. Here’s why - all the files will be saved in a directory structure that matches their path structures, starting with a directory called ‘https%3A’, as per this other Stack Overflow question & answer. So if you’ll use some long and complex URLs, you’ll have to navigate down each and every one to get the files you want. The more complex your URLs, the more complex your directory structure. Starting with ‘https%3A’, which is not fun, not fun at all. I’d rather download and upload them myself to blob storage thankyouverymuch.

# read-datasets.py

tableData = Run.get_context().input_datasets['Table']
fileData = Run.get_context().input_datasets['Files']
fileDataMount = Run.get_context().input_datasets['Files_mount']
fileDataDownload = Run.get_context().input_datasets['Files_download']

# Read the dataset - easy
print(type(tableData))
print(tableData.to_pandas_dataframe())

# Read the datadir - easy-ish

#This is the part where you would traverse the ['https%3A']/folder list to get to the files
print('Mounting the dataset manually')
with fileData.mount() as mount_context:
    # list top level mounted files and folders in the dataset
    print(os.listdir(mount_context.mount_point))

print('Using an `as_mount` reference')
print(os.listdir(fileDataMount))

print('Using an `as_download` reference')
print(os.listdir(fileDataDownload))

2. Passing Data Between Pipeline Steps with PipelineData

While Datasets are used mainly as inputs, PipelineData represents all kinds of intermediate data in Azure Machine Learning pipelines. My favorite way for passing data between pipeline steps, it’s easy to reason about, and more importantly, easy to test locally. Plus, you can use it to send anything to pipeline steps including files, directories, pickled models, heck, even smoke signals if you set the right **kwargs. It’s safe to say I like PipelineData.

Its API is simple enough, you just create an instance with a name, and then configure your step to use it as an argument. You also have to tell the pipeline steps whether your data is an input or an output. This is something that OutputFileDatasetConfig does away with for example.

Below is some sample code that shows how to configure two Python script steps that send and receive some data using PipelineData. Note how I’m sending the parameter references both in the arguments and in the outputs, respectively inputs lists. That’s because I don’t want to get the friendly ValueError: Input/Output dataset appears in arguments list but is not in the input/output lists error message.

from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep

dataset_param = PipelineData('dataset')
datadir_param = PipelineData('datadir', is_directory=True)

write_step = PythonScriptStep(
    name='The Writer',
    script_name='write.py',
    arguments=['--dataset', dataset_param, '--datadir', datadir_param],
    outputs=[dataset_param, datadir_param],
    compute_target=compute_target,
    source_directory='./writer',
    allow_reuse=False, 
)

read_step = PythonScriptStep(
    name='The Reader',
    script_name='read.py',
    arguments=['--dataset', dataset_param, '--datadir', datadir_param],
    inputs=[dataset_param, datadir_param],
    compute_target=compute_target,
    source_directory='./reader',
    allow_reuse=False
)

Writing and reading data is also easy enough, as you can see in the two Python files embedded below – the writer and reader steps.

All you need to do is parse the two arguments, and treat one as a file and the other as a directory and that’s it, and it works in the cloud, and more importantly it works on your machine just in case you want to test your pipeline locally. And you do want to test your pipeline locally, because otherwise you’ll spend minutes² waiting for the pipeline to finish every time you want to try something new, and you want to try a lot of new things, because sometimes things just don’t work as you’d expect, and you want them to work, yes you do, and you try and try and try and if you’re lucky enough they may work in the end. But I digress.

Here’s how to write and read single files and directories using PipelineData.

# write.py
import argparse
from pathlib import Path
import pandas as pd
from azureml.core.run import _OfflineRun
from azureml.core import Run, Workspace

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--dataset', dest='dataset', required=True)
    parser.add_argument('--datadir', dest='datadir', required=True)

    return parser.parse_args()

args = parse_args()
print(f'Arguments: {args.__dict__}')

# Write the dataset
df = pd.DataFrame({'bear': 'Fuzzy Wuzzy was a bear'.split(' '), 'hair': 'Fuzzy Wuzzy had no hair'.split(' ')})
df.to_csv(args.dataset, index=False)


# Write the datadir
p = Path(args.datadir)

# Make sure the directory exists
p.mkdir(parents=True, exist_ok=True)

for index, word in enumerate('So Fuzzy Wuzzy wasn\'t fuzzy, was he?'.split(' ')):
    with (p / f'{index}.txt').open('w') as f:
        f.write(word)

# read.py
import argparse
from pathlib import Path
import pandas as pd
from azureml.core.run import _OfflineRun
from azureml.core import Run, Workspace

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--dataset', dest='dataset', required=True)
    parser.add_argument('--datadir', dest='datadir', required=True)

    return parser.parse_args()

args = parse_args()
print(f'Arguments: {args.__dict__}')

# Read the dataset
df = pd.read_csv(args.dataset)
print(df)

# Read the datadir
p = Path(args.datadir)

for child in p.iterdir(): 
    with child.open('r') as f:
        print(f.read(), ' ')

3. Passing Data Between Pipeline Steps with OutputFileDatasetConfig

OutputFileDatasetConfig is another way of sending temporary, intermediate data between pipeline steps. It’s a bit more powerful than PipelineData, and is more tightly integrated with datasets including the ability to register an OutputFileDatasetConfig as a dataset, which is pretty cool in itself to be honest.

For some reason though, I don’t like it. On the one hand, that name. It looks like an internal class, not something meant to be consumed by the end-user, and I wish it got renamed to something clearer and shorter ³. Like, it’s clear what PipelineData does, judging by its name it’s some data related to pipelines. It’s not so clear what OutputFileDatasetConfig does however, not at the first and second glances at least.

It’s also not very clear to me when I should use this class over PipelineData. Is it when I need to update a Dataset to a new version in one pipeline step, and then use the updated version in another step? Not sure, really. If you do have an idea, please ping me.

Here’s how you would set up a basic pipeline using this approach. Note the first step has the basic OutputFileDatasetConfig reference configured as an argument, whereas the second step has to send an as_input to the arguments list. This is how it manages to avoid having to fill in both arguments and the inputs/outputs lists, as opposed to PipelineData.


fileConfig = OutputFileDatasetConfig(name='file_dataset_cfg')

write_output_step = PythonScriptStep(
    name='The Output Writer',
    script_name='write-output.py',
    arguments = ["--output-dir", fileConfig],
    compute_target=compute_target,
    source_directory='./output-writer',
    allow_reuse=False
)


read_output_step = PythonScriptStep(
    name='The Output Reader',
    script_name='read-output.py',
    arguments = ["--input-dir", fileConfig.as_input()],
    compute_target=compute_target,
    source_directory='./output-reader',
    allow_reuse=False,
)

Writing and reading the file data is pretty similar to how PipelineData works. I do miss the simplicity of single-file data though.

# write-output.py

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--output-dir', dest='output_dir', required=True)

    return parser.parse_args()

args = parse_args()
print(f'Arguments: {args.__dict__}')

p = Path(args.output_dir)
# First, make sure the directory exists
p.mkdir(parents=True, exist_ok=True)

for index, word in enumerate('So Fuzzy Wuzzy wasn\'t fuzzy, was he?'.split(' ')):
    with (p / f'{index}.txt').open('w') as f:
        f.write(word)

# read-output.py

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input-dir', dest='input_dir', required=True)

    return parser.parse_args()

args = parse_args()
print(f'Arguments: {args.__dict__}')

# Read the datadir
p = Path(args.input_dir)

for child in p.iterdir(): 
    with child.open('r') as f:
        print(f.read(), ' ')

Conclusion

I suspect these classes will change somehow in the future, at the moment I feel like they’re one too many 😅. Perhaps by giving PipelineData the ability to register datasets and getting rid of OutputFileDatasetConfig? One can only hope.

Until this happens, I’d rely heavily on PipelineData, use Datasets sparingly, and avoid OutputFileDatasetConfig unless I’ve got a good reason not to. Hope this helps. See you next time! 👋

If you’ve enjoyed this article, you might want to join my email list below, I’ll let you know as soon as I write something new. Also, if you want to know more about creating Azure ML Pipelines, you might want to read my other article on deploying a machine learning morel with Azure ML pipelines. It’s quite popular.

Sharing the Twitter thread is cool, too.

If you're doing work with #Azure Machine Learning pipelines and wondering what's the best approach for sending data between script steps, then this article might be just what you need. https://t.co/YliQ6vOI6J
— Vlad Iliescu (@vladiliescu) April 28, 2021

While I have a nagging suspicion that you can do some funky stuff with sys.argv[1] and emulate sending a dataset to individual pipeline steps, I just don’t have the energy to test this. ↩︎
If you’re lucky. ↩︎
I dare you to say “output file dataset config” five times fast. ↩︎

1. Using File and Tabular Datasets as Pipeline Inputs #

2. Passing Data Between Pipeline Steps with PipelineData#

3. Passing Data Between Pipeline Steps with OutputFileDatasetConfig#

Conclusion#

1. Using File and Tabular Datasets as Pipeline Inputs

2. Passing Data Between Pipeline Steps with PipelineData

3. Passing Data Between Pipeline Steps with OutputFileDatasetConfig

Conclusion