Tutorial

DVC pipelines

You can add a data processing stage to the DVC pipeline by editing both the pipeline file dvc.yaml and the corresponding configuration file configs/pipeline.yml. For each new stage, you need to define: - a new stage name key in under the stages key, for which you will add the following sub-keys: - cmd: the stage command - deps: list of files or directories whose modification make necessary to run the command again. By default, dvc repro will only run this command if any of these files change. - outs: list of command output files to be tracked with DVC.

You can use parameters defined in the configuration file configs/pipeline.yml by referring them as ${param_name} or ${param_name.sub_param_name} in case of nested parameters.

This is an example of how a stage in a DVC pipeline looks like:

Example annotation creation script

configs/pipeline.yml

sampling_period: 10

dvc.yaml

vars:
  - configs/pipeline.yml

stages:

  # . . .

  annotation_creation:
    cmd: python src/stages/annotation_creation.py
        --input_categories_file "annotations\categories.json"
        --sampling_period ${sampling_period}
        --output_annotation_file "annotations\EXAMPLE_DATASET.json"
    deps:
        - "annotations\categories.json"
    outs:
        - "annotations\EXAMPLE_DATASET.json"

Once the pipeline is defined, you can reproduce it by running:

dvc repro
dvc push

To learn more about DVC pipelines, check the DVC documentation.

Available stages

Merge multiple COCO files

In some cases it is necessary to merge the annotations from a set of COCO json files, for example, when building a new dataset from multiple existing datasets, or when a dataset has been partially annotated by serveral annotators. In such cases, please, use the template code for the merging stage, which is already tested, as follows:

Merge multiple COCO files

configs/pipeline.yml

merge_annotations:
    name_pattern: '"*.json"'

dvc.yaml

vars:
  - configs/pipeline.yml

stages:

  # . . .

  merge_annotations:  
    cmd: python src/stages/merge_multiple_coco_files.py
        --dir_path "relative/path/to/input/coco/files/"
        --output_file "annotations/{{cookiecutter.dataset_slug}}.json"
        --name_pattern ${merge_annotations.name_pattern}
    deps:
        - "relative/path/to/input/coco/files/"
    outs:
        - "annotations/{{cookiecutter.dataset_slug}}.json"

The default input file name pattern can be modified in the pipeline configuration file in order to use only a subset of files whose name matches the given pattern. To do so, you will need to use regular expressions.

Crop images

When training an image classification model to be used jointly with a generic object detector, the classifier is sometimes trained using the croped bounding boxes generated by the detector. In such cases, to build the dataset, it is first necessary to generate the cropped images from the original images and the detection annotations.

In such cases, please, use the template code for the image cropping stage, which is already tested, as follows:

[TBD]

From Segmentation to Detection annotations

At some point, we might want to turn our segmentation annotations into detection annotations.

As we know, each class corresponds to an unique index in the segmentation masks. One of the main challenges relies on the fact that we can find different instances from the same class in the original image, meaning that the bounding box, calculated from the segmentation mask, will include all instances.

In order to solve this specific challenge, we will need to compute the existing connected components for each mask and class, and split each component as an independent segmentation annotations (without modifying the segmentation mask). Then, we can calculate each bounding box with the certainty correspond to single class instances.

To use this new stage, please check the following template:

Compute connected components

dvc.yaml

vars:
  - configs/pipeline.yml

stages:

  # . . .

  merge_annotations:  
    cmd: python src/stages/compute_connected_components.py
        --ann_file "annotations/{{cookiecutter.dataset_slug}}.json"
        --segmentation_folder "path/to/segmentation/folder/"
    deps:
        - "path/to/segmentation/folder/"
    outs:
        - "annotations/{{cookiecutter.dataset_slug}}.json"

Please realize that the segmentation_folder needs to contain both images and masks folders with the corresponding images especified in the annotation file.