A Quickstart Tutorial for MLDev
Contents
Installation instructions
Please, check the project README.md for installation instructions.
Hello World!
Example for version v0.3 and above
Let us start with a simple Hello World! example.
MLDev uses YAML to describe experiments. We are going to write a simple YAML file that
prints out Hello World!
, the current time and then exits.
Step 1. Create a pipeline
After you install MLDev, create an empty folder hello_world
in your home directory.
Create an empty experiment.yml
file in the hello_world
folder.
Each experiment contains at least one pipeline. A pipeline is a sequence of steps called stages. Stages are units of computation. That is, it can be repeated as a whole. Stages take inputs from disk and produce outputs, which are in turn persisted to disk or other storage.
In order to run an experiment we create a pipeline of type GenericPipeline
with a single stage.
This basic stage does not take any inputs nor provides output files.
run_hello_world: !GenericPipeline
runs:
- !BasicStage
name: hello_world
The name hello_world
of the stage is used to log when the stage runs.
Step 2. Add a script to the stage hello_world
Then we add two bash
commands to the stage script:
field.
Here are the annotated contents of the experiment file.
pipeline: !GenericPipeline
# the pipeline sets the sequence of steps
runs:
- !BasicStage
# each step is a state
# which takes inputs from disk and
# puts outputs back to disk
name: hello_world
script:
# this is our bash script
- echo Hello World!
# note that we need not to escape double quotes
# if we use text block like this
- >
echo "Time is $(date)"
In this example we use a text block prefix „>“ from YAML that combines all similarly indented lines after it to a single line.
Step 3. Setup and run the experiment
applies to v0.3
Let us configure our experiment and set up a virtual environment for our experiment.
Generally, MLDev uses python venv
to manage code dependencies.
But in this experiment we do not have any python code yet, so initialization is simple.
Switch to the experiment folder hello_world
and run the init
command.
$ mldev init -p venv .
At this time, you should have a folder hello_world
with a file experiment.yml
and a venv
folder in it.
$ ls .
experiment.yml
venv
In order to run the experiment, use the following command.
$ mldev run
You may see many lines starting with INFO:
, ignore them for now.
Results of the stage are the last two lines, which would look something like this
$ mldev run
INFO: ...
...
Hello World!
Time is Tue Apr 10 12:19:30 CET 2021
Hello World completed!
Congrats, you have run your first MLDev pipeline.
In this tutorial we have also used the following MLDev features:
experiment setup, pipelines and stages
configuration management using venv
running the experiments
Simple classification task on template-default
Example for version v0.3 and above
Please, install MLDev as set in the previous example, if you have not done so already.
In this tutorial, we will see how MLDev uses templates to pre-configure experiment and manages changes using Git.
These MLDev features make starting a new experiment easier.
We consider a template-default
template and create a pipeline of the three stages:
prepare, that initializes the data
train, that trains the model
predict, that runs and evaluates the model
Step 1. Get the template
Getting a template for your experiment is easy. Use the mldev init
command with -t
parameter,
specify the name or URL for the template and set the destination folder.
$ mldev init -t template-default ./template-default
This will download the template and configure a new local Git repository for the experiment. If Git user and email are not configured, MLDev will ask you to provide them.
After the MLDev command completes, you should have the template-default
folder
with at least the following files and folders from the template:
models
src
results
venv
LICENSE
README.md
requirements.txt
Here README.md
describes in detail the classification task we are going to solve.
The src
folder contains the source code of the experiment.
The venv
folder contains the virtual environment the experiment will be run in.
It was created and configured for you by MLDev.
Step 2. Prepare the data
Let us create a new experiment file experiment-new.yml
and put our new pipeline in it.
We create an new pipeline named run_prepare
and add a basic stage that takes
source code ./src
as input dependency and outputs results to the data
folder.
If since the previous run no files were updated in input dependencies, when MLDev
may skip the stage and reuse its outputs.
Here is the full source code for the pipeline. Note, we are using YAML anchors to reuse parts of the pipeline further in this tutorial.
run_prepare: !GenericPipeline
runs:
# we can use anchors '&' to reuse the stage later
- !BasicStage &prepare_stage
name: prepare
# here we specify input dependencies - files or folders
inputs: [ ./src/ ]
# here are outputs - the ./data folder
# note the anchor
outputs: &input_data [ ./data/ ]
# a bash script for the stage - just run src/prepare.py
script:
- python3 src/prepare.py
- echo
In order to run the pipeline, switch to the experiment folder and run:
shell script $ mldev run -f experiment-new.yml run_prepare
After command completes, check that the ./data
folder contains the following files:
shell script $ ls ./data X_dev.pickle X_test.pickle X_train.pickle y_dev.pickle y_test.pickle y_train.pickle
If you would like to re-run the stage even if the input dependencies did not change,
add the --force-run
option like this:
mldev run -f experiment-basic.yml --force-run run_prepare
Step 3. Two stage pipeline
Now it is time to make a two stage pipeline.
We add a train
stage to the experiment-new.yml
file.
Of course, we would like to reuse the stage from our previous pipeline.
Let us refer to it using the anchor *prepare_stage
. We also reuse
outputs from that stage using the *input_data
anchor.
Here is the full definition of the pipeline
run_prepare_train: !GenericPipeline
runs:
# We use an anchor to reuse the 'prepare' stage
- *prepare_stage
# Add a second stage to the pipeline and set an anchor
- !BasicStage &train_stage
name: train
# We can use params attribute of the stage to add any needed parameters
# They can be used in computable expressions
# See below how to do it
params:
num_iters: 10
# Here we use another anchor to add data dependency on the previous stage
inputs: *input_data
outputs: &model_data [ ./models/default/model.pickle ]
script:
# We use a computable expression to get num_iters parameter
# This works similar to BASH variables, but uses python
# Environment variables are available using dollar sign $ without braces
# I.e. $PATH instead of ${PATH}
- python3 src/train.py --n ${self.params.num_iters}
- echo Current PATH= $PATH
- echo
Computable expressions is a major feature of MLDev. These expressions can be used to get a parameter value from the running experiment context (runtime representation of the dynamic YAML document) and use it in scripts or other places.
Expressions are computed at run-time on-demand. In our example, the self
variable
inside the expression refers to the currently running stage.
If you reuse the expression in another stage, that new stage will be associated with the self
.
Run this new pipeline with the command
$ mldev run -f experiment-new.yml run_prepare_train
Then check, that the outputs we set in the train
stage are created successfully.
$ ls ./models/default
model.pickle
Here default
is the name of the trial (run_name), which is specific to our experiment source code.
Run python ./src/train.py --help
for more details.
Step 4. Variables and expressions
There is no need to create a new pipeline each time, of course. You can add stages to your past pipelines whenever needed. Be aware though this will require re-running the modified pipelines.
Let us finally build the full pipeline. Add the following to the experiment-new.yml
file.
run_predict: !GenericPipeline
runs:
# We reuse past stages
- *prepare_stage
- *train_stage
# And add a new stage
- !BasicStage
name: predict
# note the use of an anchor
inputs: *model_data
outputs: [ ./results/ ]
# We can add more environment variables
# They will be available to the script at 'prepare time'
# path() is an MLDev function to compose a full path from
# a relative path or link
env:
MLDEV_MODEL_PATH: "${path(self.inputs[0])}"
script:
# Variables can be set in the
# MLDev own config in .mldev/config.yaml in section environ
# For example, PYTHON_INTERPRETER
- $PYTHON_INTERPRETER src/predict.py
# Here are two examples
# (1) In the first line we read path from the environment at run-time
# (2) In the second line we read the value using
# a computable expression at prepare-stage pass
# We avoid escaping semicolon by using a multiline block
- |
echo From the environment: $MLDEV_MODEL_PATH
echo From the stage params: ${self.env.MLDEV_MODEL_PATH}
echo
Here is the explanation of the differences between (1) and (2). MLDev runs stages in two passes. During the first pass, the „prepare-stage“ pass the stages in the pipeline are asked to check their parameters and prepare to be run. This pass occurs before anything gets executed, so in case of error no damage is done.
At the second „run-stage“ pass, the stages are executed using their validated parameters in the same sequence as at „prepare-stage“ pass.
Run the new pipeline as usual
$ mldev run -f experiment-new.yml run_predict
In the output find the following lines:
...
INFO:mldev:Loading experiment-new.yml
INFO:mldev:Running run_predict from experiment-new.yml
INFO:mldev:GenericPipeline Preparing: prepare
INFO:mldev:GenericPipeline Preparing: train
INFO:mldev:GenericPipeline Preparing: predict
INFO:mldev:GenericPipeline Running: prepare
INFO:mldev:Unchanged (prepare)
INFO:mldev:GenericPipeline Running: train
INFO:mldev:Unchanged (train)
INFO:mldev:GenericPipeline Running: predict
...
The first line identifies the experiment that is being run. The second specifies the pipeline from the experiment. Third to fifth lines inform you about the stages that are prepared to run. Then the stages are run and skipped because nothing change in the input dependencies.
After the command completes, check that the output folder contains the results.
$ ls ./results/default
dev_report.csv report.csv test_report.csv train_report.csv
Task completed!
That“s it! You have completed the second tutorial on MLDev.
In this tutorial we used the following MLDev features:
experiment templates
creating multi-stage pipelines with inputs and outputs
stage execution and lifecycle
computable expressions
environment variables for scripts
How to get help
Feel free to ask a question on t.me/mldev_betatest or submit a question/suggestion here