5. Transform Dataset
With the understanding gained from previous stages, you should now have an intuition about what data transformations will be needed before running an experiment.
Create and assing a data/transform
issue. The assigment of this issue will automatically create an associated branch. Remember to link it to the corresponding milestone
.
Note
If your data preparation pipeline requires multiple stages, you will need to open separate issues for each one of them.
The concrete data transforms that you will need to apply depends on the nature of your proyect. Reference implementations of some commonly used stages are included as part of the template.
Note
When a stage has configuration arguments is important to set them in the config/pipeline.yml
as showcased bellow.
Splits
Property Split
One of such commonly used stages consits of dividing a dataset into train/val splits based on properties of the data: property_split
.
Assuming you are working on the associated branch, you can add a new stage for the split by editing both configs/pipeline.yml
and dvc.yaml
:
Adding a property_split stage
EXAMPLE_DATASET:
version: v0.1.0
property_split:
property_name: "file_name"
pattern_to_match: "'example_regex_pattern'"
vars:
- configs/pipeline.yml
stages:
# . . .
data-transform-property_split-EXAMPLE_DATASET:
cmd: property_split
--annotations_file results/data/import/EXAMPLE_DATASET.json
--output_matched_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_val.json
--property_name ${EXAMPLE_DATASET.property_split.property_name}
--pattern_to_match ${EXAMPLE_DATASET.property_split.pattern_to_match}
--output_unmatched_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_train.json
deps:
- src/stages/data/transform/property_split.py
- results/data/import/EXAMPLE_DATASET.json
outs:
- results/data/transform/property_split-EXAMPLE_DATASET
Note
The pattern_to_match
value is enclosed with inner single quotes in addition to the external double quotes. This is required in order to use special caracters commonly used in regex like |
. If you remove the inner single quotes, dvc repro
will fail.
As you have guessed, in a terminal run:
dvc repro
And it will generate the output_matched_file
and output_unmatched_file
.
You can add, commit and push those changes:
git add configs/pipeline.yml dvc.yaml dvc.lock results/data/
git commit -m "Applied property_split to COCO_ANIMALS"
git push
dvc push
Open a pull request, link it to the issue using keywords, and fill the corresponding data/transform
section.
Note
Remember that when creating a pull request after having completed a data/transform
stage, it is really useful to write in the pull request conversation how many files and annotations are there per split, so as to share that information with reviewers. There are good practices to build on, keep it in mind :)
Random Split
In some datasets it will be difficult to perform a split based on properties, for example when we have a lot of annotated images but without any specific properties. In these situations the only way to split is randomly: random_split
.
The process will be identical to the property_split
but with minor changes to the dvc.yaml
and configs/pipeline.yml
.
Adding a random_split stage
EXAMPLE_DATASET:
version: v0.1.0
random_split:
val_proportion: 0.05
random_seed: 47
vars:
- configs/pipeline.yml
stages:
# . . .
data-transform-random_split-EXAMPLE_DATASET:
cmd: random_split
--annotations_file results/data/import/EXAMPLE_DATASET.json
--output_matched_file results/data/transform/random_split-EXAMPLE_DATASET/EXAMPLE_DATASET_val.json
--val_proportion ${EXAMPLE_DATASET.random_split.val_proportion}
--random_seed ${EXAMPLE_DATASET.random_split.random_seed}
--output_unmatched_file results/data/transform/random_split-EXAMPLE_DATASET/EXAMPLE_DATASET_train.json
deps:
- src/stages/data/transform/random_split_[task].py
- results/data/import/EXAMPLE_DATASET.json
outs:
- results/data/transform/random_split-EXAMPLE_DATASET
Format conversion
In addition to the split stage, for some of the supported_tasks (currently classification and segmentation), you will also need to include stages for converting the splits from the default ai-dataset
format (COCO classification, COCO segmentation) to the format required by the training frameworks.
Reference implementations for this stages (classification, segmentation) are also included in the template:
Adding a format conversion stage
# dvc.yaml
# . . .
data-coco_to_mmclassification-EXAMPLE_DATASET:
foreach:
- train
- val
do:
cmd: python src/stages/data/transform/coco_to_mmclassification.py
--annotations_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
--output_file results/data/transform/coco_to_mmclassification-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt
deps:
- src/stages/data/transform/coco_to_mmclassification.py
- results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
outs:
- results/data/transform/coco_to_mmclassification-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt
# dvc.yaml
# . . .
data-coco_to_mmsegmentation-EXAMPLE_DATASET:
foreach:
- train
- val
do:
cmd: python src/stages/data/transform/coco_to_mmsegmentation.py
--annotations_file results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
--output_annotations_file results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt
--output_masks_dir results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/masks/${item}
deps:
- src/stages/data/transform/coco_to_mmsegmentation.py
- results/data/transform/property_split-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.json
outs:
- results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/EXAMPLE_DATASET_${item}.txt
- results/data/transform/coco_to_mmsegmentation-EXAMPLE_DATASET/masks/${item}
Note
The format conversion transform is applied separately to the train/val splits using DVC foreach stages
After running:
dvc repro
You can add, commit and push those changes:
git add dvc.yaml dvc.lock results/data/
git commit -m "Applied coco_to_mmsegmentation to COCO_ANIMALS"
git push
Open a pull request, link it to the issue using keywords, and fill the corresponding data/transform section.
Exploring pipeline with dvc dag
As you might have noticed, the contents of dvc.yaml
are rapidly growing; DVC includes a tool for easily visualizing the complete pipeline:
dvc dag
Example output:
+------------------------+
| import-EXAMPLE_DATASET |
+------------------------+
+----------------------------------------------+
| results/data/import/EXAMPLE_DATASET.json.dvc |
+----------------------------------------------+
***** *****
***** *****
*** ***
+-------------------------+ +-------------------------------------+
| explore-EXAMPLE_DATASET | | data-property_split-EXAMPLE_DATASET |
+-------------------------+ +-------------------------------------+
******* *******
****** ******
**** ****
+---------------------------------------------------+ +-------------------------------------------------+
| data-coco_to_mmsegmentation-EXAMPLE_DATASET_train | | data-coco_to_mmsegmentation-EXAMPLE_DATASET_val |
+---------------------------------------------------+ +-------------------------------------------------+
Note
When needed, you can open data/explore issues and include those stages in the pipeline for exploring intermediate results of transforms.