# A Quickstart Tutorial for MLDev ## Contents - [Installation instructions](#installation-instructions) - [Tutorial 1. Hello World!](#hello-world) - [Create a pipeline](#step-1-create-a-pipeline) - [Add a stage](#step-2-add-a-script-to-the-stage-hello_world) - [Setup and run](#step-3-setup-and-run-the-experiment) - [Tutorial 2. Classification pipeline](#simple-classification-task-on-template-default) - [Get the template](#step-1-get-the-template) - [Prepare the data](#step-2-prepare-the-data) - [Two stage pipeline](#step-3-two-stage-pipeline) - [Variables and expressions](#step-4-variables-and-expressions) - [Tutorial 3. Using the Collaboration Tool](#using-the-collaboration-tool) - [How to get help](#how-to-get-help) ## Installation instructions Please, check the project [README.md](https://gitlab.com/mlrep/mldev) for installation instructions. ## Hello World! > Example for version v0.3 and above Let us start with a simple Hello World! example. MLDev uses YAML to describe experiments. We are going to write a simple YAML file that prints out ``Hello World!``, the current time and then exits. ### Step 1. Create a pipeline After you install MLDev, create an empty folder ``hello_world`` in your home directory. Create an empty ``experiment.yml`` file in the ``hello_world`` folder. Each experiment contains at least one pipeline. A pipeline is a sequence of steps called stages. Stages are units of computation. That is, it can be repeated as a whole. Stages take inputs from disk and produce outputs, which are in turn persisted to disk or other storage. In order to run an experiment we create a pipeline of type ``GenericPipeline`` with a single stage. This basic stage does not take any inputs nor provides output files. ```yaml run_hello_world: !GenericPipeline runs: - !BasicStage name: hello_world ``` The name ``hello_world`` of the stage is used to log when the stage runs. ### Step 2. Add a script to the stage hello_world Then we add two ``bash`` commands to the stage ``script:`` field. Here are the annotated contents of the experiment file. ```yaml pipeline: !GenericPipeline # the pipeline sets the sequence of steps runs: - !BasicStage # each step is a state # which takes inputs from disk and # puts outputs back to disk name: hello_world script: # this is our bash script - echo Hello World! # note that we need not to escape double quotes # if we use text block like this - > echo "Time is $(date)" ``` In this example we use a text block prefix '>' from YAML that combines all similarly indented lines after it to a single line. ### Step 3. Setup and run the experiment > applies to v0.3 Let us configure our experiment and set up a virtual environment for our experiment. Generally, MLDev uses python ``venv`` to manage code dependencies. But in this experiment we do not have any python code yet, so initialization is simple. Switch to the experiment folder ``hello_world`` and run the ``init`` command. ```shell script $ mldev init -p venv . ``` At this time, you should have a folder ``hello_world`` with a file ``experiment.yml`` and a ``venv`` folder in it. ```shell script $ ls . experiment.yml venv ``` In order to run the experiment, use the following command. ```shell script $ mldev run ``` You may see many lines starting with ``INFO:``, ignore them for now. Results of the stage are the last two lines, which would look something like this ```shell script $ mldev run INFO: ... ... Hello World! Time is Tue Apr 10 12:19:30 CET 2021 ``` ### Hello World completed! Congrats, you have run your first MLDev pipeline. In this tutorial we have also used the following MLDev features: - experiment setup, pipelines and stages - configuration management using venv - running the experiments ## Simple classification task on template-default > Example for version v0.3 and above Please, install MLDev as set in the previous example, if you have not done so already. In this tutorial, we will see how MLDev uses templates to pre-configure experiment and manages changes using Git. These MLDev features make starting a new experiment easier. We consider a ``template-default`` template and create a pipeline of the three stages: - prepare, that initializes the data - train, that trains the model - predict, that runs and evaluates the model ### Step 1. Get the template Getting a template for your experiment is easy. Use the ``mldev init`` command with ``-t`` parameter, specify the name or URL for the template and set the destination folder. ```shell script $ mldev init -t template-default ./template-default ``` This will download the template and configure a new local Git repository for the experiment. If Git user and email are not configured, MLDev will ask you to provide them. After the MLDev command completes, you should have the ``template-default`` folder with at least the following files and folders from the template: ```shell script models src results venv LICENSE README.md requirements.txt ``` Here ``README.md`` describes in detail the classification task we are going to solve. The ``src`` folder contains the source code of the experiment. The ``venv`` folder contains the virtual environment the experiment will be run in. It was created and configured for you by MLDev. ### Step 2. Prepare the data Let us create a new experiment file ``experiment-new.yml`` and put our new pipeline in it. We create an new pipeline named ``run_prepare`` and add a basic stage that takes source code ``./src`` as input dependency and outputs results to the ``data`` folder. If since the previous run no files were updated in input dependencies, when MLDev may skip the stage and reuse its outputs. Here is the full source code for the pipeline. Note, we are using YAML anchors to reuse parts of the pipeline further in this tutorial. ```yaml run_prepare: !GenericPipeline runs: # we can use anchors '&' to reuse the stage later - !BasicStage &prepare_stage name: prepare # here we specify input dependencies - files or folders inputs: [ ./src/ ] # here are outputs - the ./data folder # note the anchor outputs: &input_data [ ./data/ ] # a bash script for the stage - just run src/prepare.py script: - python3 src/prepare.py - echo ``` In order to run the pipeline, switch to the experiment folder and run: ``shell script $ mldev run -f experiment-new.yml run_prepare `` After command completes, check that the ``./data`` folder contains the following files: ``shell script $ ls ./data X_dev.pickle X_test.pickle X_train.pickle y_dev.pickle y_test.pickle y_train.pickle `` If you would like to re-run the stage even if the input dependencies did not change, add the ``--force-run`` option like this: ```shell script mldev run -f experiment-basic.yml --force-run run_prepare ``` ### Step 3. Two stage pipeline Now it is time to make a two stage pipeline. We add a ``train`` stage to the ``experiment-new.yml`` file. Of course, we would like to reuse the stage from our previous pipeline. Let us refer to it using the anchor ``*prepare_stage``. We also reuse outputs from that stage using the ``*input_data`` anchor. Here is the full definition of the pipeline ```yaml run_prepare_train: !GenericPipeline runs: # We use an anchor to reuse the 'prepare' stage - *prepare_stage # Add a second stage to the pipeline and set an anchor - !BasicStage &train_stage name: train # We can use params attribute of the stage to add any needed parameters # They can be used in computable expressions # See below how to do it params: num_iters: 10 # Here we use another anchor to add data dependency on the previous stage inputs: *input_data outputs: &model_data [ ./models/default/model.pickle ] script: # We use a computable expression to get num_iters parameter # This works similar to BASH variables, but uses python # Environment variables are available using dollar sign $ without braces # I.e. $PATH instead of ${PATH} - python3 src/train.py --n ${self.params.num_iters} - echo Current PATH= $PATH - echo ``` Computable expressions is a major feature of MLDev. These expressions can be used to get a parameter value from the running experiment context (runtime representation of the dynamic YAML document) and use it in scripts or other places. Expressions are computed at run-time on-demand. In our example, the ``self`` variable inside the expression refers to the currently running stage. If you reuse the expression in another stage, that new stage will be associated with the ``self``. Run this new pipeline with the command ```shell script $ mldev run -f experiment-new.yml run_prepare_train ``` Then check, that the outputs we set in the ``train`` stage are created successfully. ```shell script $ ls ./models/default model.pickle ``` Here ``default`` is the name of the trial (run_name), which is specific to our experiment source code. Run ``python ./src/train.py --help`` for more details. ### Step 4. Variables and expressions There is no need to create a new pipeline each time, of course. You can add stages to your past pipelines whenever needed. Be aware though this will require re-running the modified pipelines. Let us finally build the full pipeline. Add the following to the ``experiment-new.yml`` file. ```yaml run_predict: !GenericPipeline runs: # We reuse past stages - *prepare_stage - *train_stage # And add a new stage - !BasicStage name: predict # note the use of an anchor inputs: *model_data outputs: [ ./results/ ] # We can add more environment variables # They will be available to the script at 'prepare time' # path() is an MLDev function to compose a full path from # a relative path or link env: MLDEV_MODEL_PATH: "${path(self.inputs[0])}" script: # Variables can be set in the # MLDev own config in .mldev/config.yaml in section environ # For example, PYTHON_INTERPRETER - $PYTHON_INTERPRETER src/predict.py # Here are two examples # (1) In the first line we read path from the environment at run-time # (2) In the second line we read the value using # a computable expression at prepare-stage pass # We avoid escaping semicolon by using a multiline block - | echo From the environment: $MLDEV_MODEL_PATH echo From the stage params: ${self.env.MLDEV_MODEL_PATH} echo ``` Here is the explanation of the differences between (1) and (2). MLDev runs stages in two passes. During the first pass, the 'prepare-stage' pass the stages in the pipeline are asked to check their parameters and prepare to be run. This pass occurs before anything gets executed, so in case of error no damage is done. At the second 'run-stage' pass, the stages are executed using their validated parameters in the same sequence as at 'prepare-stage' pass. Run the new pipeline as usual ```shell script $ mldev run -f experiment-new.yml run_predict ``` In the output find the following lines: ```shell script ... INFO:mldev:Loading experiment-new.yml INFO:mldev:Running run_predict from experiment-new.yml INFO:mldev:GenericPipeline Preparing: prepare INFO:mldev:GenericPipeline Preparing: train INFO:mldev:GenericPipeline Preparing: predict INFO:mldev:GenericPipeline Running: prepare INFO:mldev:Unchanged (prepare) INFO:mldev:GenericPipeline Running: train INFO:mldev:Unchanged (train) INFO:mldev:GenericPipeline Running: predict ... ``` The first line identifies the experiment that is being run. The second specifies the pipeline from the experiment. Third to fifth lines inform you about the stages that are prepared to run. Then the stages are run and skipped because nothing change in the input dependencies. After the command completes, check that the output folder contains the results. ```shell script $ ls ./results/default dev_report.csv report.csv test_report.csv train_report.csv ``` ### Task completed! That's it! You have completed the second tutorial on MLDev. In this tutorial we used the following MLDev features: - experiment templates - creating multi-stage pipelines with inputs and outputs - stage execution and lifecycle - computable expressions - environment variables for scripts ## Using the Collaboration Tool Please, watch the tutorial at [the Collaboration Tool page](https://gitlab.com/mlrep/mldev/-/wikis/mldev-collab-tool#tutorial). ## How to get help Feel free to ask a question on [t.me/mldev_betatest](https://t.me/mldev_betatest) or submit a question/suggestion [here](https://gitlab.com/mlrep/mldev/-/issues)