# A Quickstart Tutorial for MLDev

## Contents

- [Installation instructions](#installation-instructions)
- [Tutorial 1. Hello World!](#hello-world)
  - [Create a pipeline](#step-1-create-a-pipeline)
  - [Add a stage](#step-2-add-a-script-to-the-stage-hello_world)
  - [Setup and run](#step-3-setup-and-run-the-experiment)
- [Tutorial 2. Classification pipeline](#simple-classification-task-on-template-default)
  - [Get the template](#step-1-get-the-template)
  - [Prepare the data](#step-2-prepare-the-data)
  - [Two stage pipeline](#step-3-two-stage-pipeline)
  - [Variables and expressions](#step-4-variables-and-expressions)
- [Tutorial 3. Using the Collaboration Tool](#using-the-collaboration-tool)
- [How to get help](#how-to-get-help)


## Installation instructions

Please, check the project [README.md](https://gitlab.com/mlrep/mldev) for installation instructions. 

## Hello World!
> Example for version v0.3 and above

Let us start with a simple Hello World! example. 

MLDev uses YAML to describe experiments. We are going to write a simple YAML file that 
prints out ``Hello World!``, the current time and then exits. 

### Step 1. Create a pipeline

After you install MLDev, create an empty folder ``hello_world`` in your home directory.
Create an empty ``experiment.yml`` file in the ``hello_world`` folder.

Each experiment contains at least one pipeline. A pipeline is a sequence of steps called stages.
Stages are units of computation. That is, it can be repeated as a whole. 
Stages take inputs from disk and produce outputs, which are in turn persisted to disk or other storage.   

In order to run an experiment we create a pipeline of type ``GenericPipeline`` with a single stage.
This basic stage does not take any inputs nor provides output files.

```yaml
run_hello_world: !GenericPipeline
  runs:
    - !BasicStage
      name: hello_world
```
The name ``hello_world`` of the stage is used to log when the stage runs.

### Step 2. Add a script to the stage hello_world

Then we add two ``bash`` commands to the stage ``script:`` field. 
Here are the annotated contents of the experiment file.

```yaml
pipeline: !GenericPipeline
  # the pipeline sets the sequence of steps 
  runs:
    - !BasicStage
      # each step is a state
      # which takes inputs from disk and
      # puts outputs back to disk
      name: hello_world
      script:
        # this is our bash script
        - echo Hello World!
        # note that we need not to escape double quotes
        # if we use text block like this 
        - >
          echo "Time is $(date)"
```

In this example we use a text block prefix '>' from YAML that combines all 
similarly indented lines after it to a single line.

### Step 3. Setup and run the experiment
> applies to v0.3

Let us configure our experiment and set up a virtual environment for our experiment.
Generally, MLDev uses python ``venv`` to manage code dependencies.
But in this experiment we do not have any python code yet, so initialization is simple.

Switch to the experiment folder ``hello_world`` and run the ``init`` command.

```shell script
$ mldev init -p venv .
```

At this time, you should have a folder ``hello_world`` 
with a file ``experiment.yml`` and a ``venv`` folder in it.

```shell script
$ ls .
experiment.yml
venv
```

In order to run the experiment, use the following command.

```shell script
$ mldev run
```

You may see many lines starting with ``INFO:``, ignore them for now. 
Results of the stage are the last two lines, which would look something like this

```shell script
$ mldev run
INFO: ...
...
Hello World!
Time is Tue Apr 10 12:19:30 CET 2021
```

### Hello World completed!

Congrats, you have run your first MLDev pipeline. 

In this tutorial we have also used the following MLDev features:
 - experiment setup, pipelines and stages
 - configuration management using venv
 - running the experiments 

## Simple classification task on template-default
> Example for version v0.3 and above

Please, install MLDev as set in the previous example, if you have not done so already.

In this tutorial, we will see how MLDev uses templates to pre-configure experiment and manages changes using Git.

These MLDev features make starting a new experiment easier.
We consider a ``template-default`` template and create a pipeline of the three stages:
 - prepare, that initializes the data
 - train, that trains the model
 - predict, that runs and evaluates the model

### Step 1. Get the template

Getting a template for your experiment is easy. Use the ``mldev init`` command with ``-t`` parameter, 
specify the name or URL for the template and set the destination folder.

```shell script
$ mldev init -t template-default ./template-default
```

This will download the template and configure a new local Git repository 
for the experiment. If Git user and email are not configured, MLDev will
ask you to provide them.  

After the MLDev command completes, you should have the ``template-default`` folder
with at least the following files and folders from the template:

```shell script
models
src
results
venv
LICENSE
README.md
requirements.txt
```

Here ``README.md`` describes in detail the classification task we are going to solve.
The ``src`` folder contains the source code of the experiment. 
The ``venv`` folder contains the virtual environment the experiment will be run in. 
It was created and configured for you by MLDev. 
 
 
### Step 2. Prepare the data

Let us create a new experiment file ``experiment-new.yml`` and put our new pipeline in it.

We create an new pipeline named ``run_prepare`` and add a basic stage that takes 
source code ``./src`` as input dependency and outputs results to the ``data`` folder.
If since the previous run no files were updated in input dependencies, when MLDev
may skip the stage and reuse its outputs. 

Here is the full source code for the pipeline. Note, we are using YAML anchors
to reuse parts of the pipeline further in this tutorial.

```yaml
run_prepare: !GenericPipeline
  runs:
    # we can use anchors '&' to reuse the stage later
    - !BasicStage &prepare_stage
      name: prepare
      # here we specify input dependencies - files or folders
      inputs: [ ./src/ ]

      # here are outputs - the ./data folder
      # note the anchor
      outputs: &input_data [ ./data/ ]

      # a bash script for the stage - just run src/prepare.py
      script:
        - python3 src/prepare.py
        - echo
```

In order to run the pipeline, switch to the experiment folder and run:

``shell script
$ mldev run -f experiment-new.yml run_prepare
``

After command completes, check that the ``./data`` folder contains the following files:

``shell script
$ ls ./data
X_dev.pickle  
X_test.pickle  
X_train.pickle  
y_dev.pickle  
y_test.pickle  
y_train.pickle
``

If you would like to re-run the stage even if the input dependencies did not change,
add the ``--force-run`` option like this:

```shell script
mldev run -f experiment-basic.yml --force-run run_prepare
```

### Step 3. Two stage pipeline

Now it is time to make a two stage pipeline. 
We add a ``train`` stage to the ``experiment-new.yml`` file.

Of course, we would like to reuse the stage from our previous pipeline. 
Let us refer to it using the anchor ``*prepare_stage``. We also reuse
outputs from that stage using the ``*input_data`` anchor.

Here is the full definition of the pipeline

```yaml
run_prepare_train: !GenericPipeline
  runs:
    # We use an anchor to reuse the 'prepare' stage
    - *prepare_stage

    # Add a second stage to the pipeline and set an anchor
    - !BasicStage &train_stage
      name: train

      # We can use params attribute of the stage to add any needed parameters
      # They can be used in computable expressions
      # See below how to do it
      params:
        num_iters: 10

      # Here we use another anchor to add data dependency on the previous stage
      inputs: *input_data
      outputs: &model_data [ ./models/default/model.pickle ]
      
      script:
        # We use a computable expression to get num_iters parameter 
        # This works similar to BASH variables, but uses python
        # Environment variables are available using dollar sign $ without braces
        # I.e. $PATH instead of ${PATH}
        - python3 src/train.py --n ${self.params.num_iters}
        - echo Current PATH= $PATH
        - echo
```

Computable expressions is a major feature of MLDev. These expressions can be used
to get a parameter value from the running experiment context 
(runtime representation of the dynamic YAML document) and use it in scripts or other places.

Expressions are computed at run-time on-demand. In our example, the ``self`` variable
inside the expression refers to the currently running stage. 
If you reuse the expression in another stage, that new stage will be associated with the ``self``. 

Run this new pipeline with the command

```shell script
$ mldev run -f experiment-new.yml run_prepare_train
```

Then check, that the outputs we set in the ``train`` stage are created successfully.

```shell script
$ ls ./models/default
model.pickle
```

Here ``default`` is the name of the trial (run_name), which is specific to our experiment source code.
Run ``python ./src/train.py --help`` for more details.

### Step 4. Variables and expressions

There is no need to create a new pipeline each time, of course. 
You can add stages to your past pipelines whenever needed. 
Be aware though this will require re-running the modified pipelines.  

Let us finally build the full pipeline. Add the following to the ``experiment-new.yml`` file.

```yaml
run_predict: !GenericPipeline
  runs:

    # We reuse past stages
    - *prepare_stage
    - *train_stage

    # And add a new stage
    - !BasicStage
      name: predict
  
      # note the use of an anchor
      inputs: *model_data
      outputs: [ ./results/ ]

      # We can add more environment variables 
      # They will be available to the script at 'prepare time'
      # path() is an MLDev function to compose a full path from 
      # a relative path or link
      env:
        MLDEV_MODEL_PATH: "${path(self.inputs[0])}"
      script:
        # Variables can be set in the 
        # MLDev own config in .mldev/config.yaml in section environ
        # For example, PYTHON_INTERPRETER
        - $PYTHON_INTERPRETER src/predict.py

        # Here are two examples
        # (1) In the first line we read path from the environment at run-time
        # (2) In the second line we read the value using 
        # a computable expression at prepare-stage pass
        # We avoid escaping semicolon by using a multiline block
        - |
          echo From the environment: $MLDEV_MODEL_PATH
          echo From the stage params: ${self.env.MLDEV_MODEL_PATH}
          echo
```

Here is the explanation of the differences between (1) and (2). 
MLDev runs stages in two passes. During the first pass, the 'prepare-stage' pass
the stages in the pipeline are asked to check their parameters and prepare to be run.
This pass occurs before anything gets executed, so in case of error no damage is done.

At the second 'run-stage' pass, the stages are executed using their validated parameters 
in the same sequence as at 'prepare-stage' pass.

Run the new pipeline as usual

```shell script
$ mldev run -f experiment-new.yml run_predict
```

In the output find the following lines:

```shell script
...
INFO:mldev:Loading experiment-new.yml
INFO:mldev:Running run_predict from experiment-new.yml
INFO:mldev:GenericPipeline Preparing: prepare
INFO:mldev:GenericPipeline Preparing: train
INFO:mldev:GenericPipeline Preparing: predict
INFO:mldev:GenericPipeline Running: prepare
INFO:mldev:Unchanged (prepare)
INFO:mldev:GenericPipeline Running: train
INFO:mldev:Unchanged (train)
INFO:mldev:GenericPipeline Running: predict
...
``` 

The first line identifies the experiment that is being run. 
The second specifies the pipeline from the experiment.
Third to fifth lines inform you about the stages that are prepared to run.
Then the stages are run and skipped because nothing change in the input dependencies.

After the command completes, check that the output folder contains the results.

```shell script
$ ls ./results/default
dev_report.csv  report.csv  test_report.csv  train_report.csv
```

### Task completed!

That's it! You have completed the second tutorial on MLDev.

In this tutorial we used the following MLDev features:
 - experiment templates
 - creating multi-stage pipelines with inputs and outputs
 - stage execution and lifecycle
 - computable expressions
 - environment variables for scripts
 
## Using the Collaboration Tool

Please, watch the tutorial at [the Collaboration Tool page](https://gitlab.com/mlrep/mldev/-/wikis/mldev-collab-tool#tutorial).
 
## How to get help

Feel free to ask a question on [t.me/mldev_betatest](https://t.me/mldev_betatest) or submit a question/suggestion [here](https://gitlab.com/mlrep/mldev/-/issues)