Skip to content

3. Explore Dataset

Once your dataset has been imported, it is important to gain a better understanding. This stage can be usefull for detecting potential issues with the data/annotations and deciding the required data/transform stages to be applied afterwards.

Create and assing a data/explore issue. The assigment of this issue will automatically create an associated branch. Remember to link it to the corresponding milestone.

Reference exploration stages for each of the supported tasks are included as part of the template. You can extend or modify the python script to suit your exploration needs (i.e. if your dataset contains some custom atributes).

Assuming you are working on the associated branch, you can add a new stage to dvc.yaml exploring each dataset, replacing placeholder [task] by one of the of the supported task names:

Adding a explore stage

vars:
  - configs/pipeline.yml

stages:

  # ...

  data-explore-EXAMPLE_DATASET:
    cmd: explore_ground_truth
      --ground_truth_file results/data/import/EXAMPLE_DATASET.json
      --output_folder results/data/explore/EXAMPLE_DATASET
    deps:
      - src/stages/data/explore/explore_ground_truth_[task].py
      - results/data/import/EXAMPLE_DATASET.json
    outs:
      - results/data/explore/EXAMPLE_DATASET

Note

When adding a stage that uses a python script it's important to include the actual code as part of the dependencies (deps). This ensures that if you make changes to the code inside the script the stage will be re-runed by dvc repro.

After that, in a terminal you should run:

dvc repro

This will generate the plotting results in the indicated output_folder. Once you have gained a better understanding of the dataset, you can add, commit and push those changes:

git add dvc.yaml dvc.lock results/data/
git commit -m "Explored EXAMPLE_DATASET"
git push
dvc push

Open a pull request, link it to the issue using keywords, and fill the corresponding data/explore section.

Note

Remember that when creating a pull request after having completed a data/explore stage, it is really useful to upload to the pull request conversation some screenshots of the exploration results, so as to share with reviewers the understanding gained about the dataset. There are good practices to build on, keep it in mind :)