Skip to content

5. Transform Dataset

With the understanding gained from previous stages, you should now have an intuition about what data transformations will be needed before running an experiment.

Create and assing a data/transform issue. The assigment of this issue will automatically create an associated branch. Remember to link it to the corresponding milestone.

Note

If your data preparation pipeline requires multiple stages, you will need to open separate issues for each one of them.

The concrete data transforms that you will need to apply depends on the nature of your proyect. Reference implementations of some commonly used stages are included as part of the template.

Note

When a stage has configuration arguments is important to set them in the config/pipeline.yml as showcased bellow.

Splits

Property Split

One of such commonly used stages consits of dividing a dataset into train/val splits based on properties of the data: property_split.

Assuming you are working on the associated branch, you can add a new stage for the split by editing both configs/pipeline.yml and dvc.yaml:

Adding a property_split stage

EXAMPLE_DATASET:
  version: v0.1.0
  property_split:
    property_name: "file_name"
    pattern_to_match: "'example_regex_pattern'"
vars:
  - configs/pipeline.yml

stages:

  # . . .

  data-transform-property_split-EXAMPLE_DATASET:
    cmd: property_split
      --annotations_file results/data/import/EXAMPLE_DATASET.json
      --output_matched_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_val.json
      --property_name ${EXAMPLE_DATASET.property_split.property_name}
      --pattern_to_match ${EXAMPLE_DATASET.property_split.pattern_to_match}
      --output_unmatched_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_train.json
    deps:
      - src/stages/data/transform/property_split.py
      - results/data/import/EXAMPLE_DATASET.json
    outs:
      - results/data/transform/property_split-EXAMPLE_DATASET

Note

The pattern_to_match value is enclosed with inner single quotes in addition to the external double quotes. This is required in order to use special caracters commonly used in regex like |. If you remove the inner single quotes, dvc repro will fail.

As you have guessed, in a terminal run:

dvc repro

And it will generate the output_matched_file and output_unmatched_file.

You can add, commit and push those changes:

git add configs/pipeline.yml dvc.yaml dvc.lock results/data/
git commit -m "Applied property_split to COCO_ANIMALS"
git push
dvc push

Open a pull request, link it to the issue using keywords, and fill the corresponding data/transform section.

Note

Remember that when creating a pull request after having completed a data/transform stage, it is really useful to write in the pull request conversation how many files and annotations are there per split, so as to share that information with reviewers. There are good practices to build on, keep it in mind :)

Random Split

In some datasets it will be difficult to perform a split based on properties, for example when we have a lot of annotated images but without any specific properties. In these situations the only way to split is randomly: random_split.

The process will be identical to the property_split but with minor changes to the dvc.yaml and configs/pipeline.yml.

Adding a random_split stage

EXAMPLE_DATASET:
  version: v0.1.0
  random_split:
    val_proportion: 0.05
    random_seed: 47
vars:
  - configs/pipeline.yml

stages:

  # . . .

  data-transform-random_split-EXAMPLE_DATASET:
    cmd: random_split
      --annotations_file results/data/import/EXAMPLE_DATASET.json
      --output_matched_file results/data/transform/random_split-EXAMPLE_DATASET/EXAMPLE_DATASET_val.json
      --val_proportion ${EXAMPLE_DATASET.random_split.val_proportion}
      --random_seed ${EXAMPLE_DATASET.random_split.random_seed}
      --output_unmatched_file results/data/transform/random_split-EXAMPLE_DATASET/EXAMPLE_DATASET_train.json
    deps:
      - src/stages/data/transform/random_split_[task].py
      - results/data/import/EXAMPLE_DATASET.json
    outs:
      - results/data/transform/random_split-EXAMPLE_DATASET

Format conversion

In addition to the split stage, for some of the supported_tasks (currently classification and segmentation), you will also need to include stages for converting the splits from the default ai-dataset format (COCO classification, COCO segmentation) to the format required by the training frameworks.

Reference implementations for this stages (classification, segmentation) are also included in the template:

Adding a format conversion stage

# dvc.yaml
# . . .

  data-coco_to_mmclassification-EXAMPLE_DATASET:
    foreach:
      - train
      - val
    do:
      cmd: python src/stages/data/transform/coco_to_mmclassification.py
        --annotations_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
        --output_file results/data/transform/coco_to_mmclassification-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt
      deps:
        - src/stages/data/transform/coco_to_mmclassification.py
        - results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
      outs:
        - results/data/transform/coco_to_mmclassification-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt  
# dvc.yaml
# . . .

  data-coco_to_mmsegmentation-EXAMPLE_DATASET:
    foreach:
      - train
      - val
    do:
      cmd: python src/stages/data/transform/coco_to_mmsegmentation.py
        --annotations_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
        --output_annotations_file results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt
        --output_masks_dir results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/masks/${item}
      deps:
        - src/stages/data/transform/coco_to_mmsegmentation.py
        - results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
      outs:
        - results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt
        - results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/masks/${item}

Note

The format conversion transform is applied separately to the train/val splits using DVC foreach stages

After running:

dvc repro

You can add, commit and push those changes:

git add dvc.yaml dvc.lock results/data/
git commit -m "Applied coco_to_mmsegmentation to COCO_ANIMALS"
git push

Open a pull request, link it to the issue using keywords, and fill the corresponding data/transform section.


Exploring pipeline with dvc dag

As you might have noticed, the contents of dvc.yaml are rapidly growing; DVC includes a tool for easily visualizing the complete pipeline:

dvc dag

Example output:

+------------------------+
| import-EXAMPLE_DATASET |
+------------------------+
                    +----------------------------------------------+  
                    | results/data/import/EXAMPLE_DATASET.json.dvc |  
                    +----------------------------------------------+  
                          *****                         *****  
                     *****                                   *****  
                  ***                                             ***  
+-------------------------+                           +-------------------------------------+  
| explore-EXAMPLE_DATASET |                           | data-property_split-EXAMPLE_DATASET |  
+-------------------------+                           +-------------------------------------+  
                                                   *******                                *******  
                                             ******                                              ******  
                                         ****                                                          ****  
        +---------------------------------------------------+                           +-------------------------------------------------+
        | data-coco_to_mmsegmentation-EXAMPLE_DATASET_train |                           | data-coco_to_mmsegmentation-EXAMPLE_DATASET_val |
        +---------------------------------------------------+                           +-------------------------------------------------+

Note

When needed, you can open data/explore issues and include those stages in the pipeline for exploring intermediate results of transforms.