Skip to content

2. Create/Import/Update Dataset

Once the the milestone has been defined, the next step will be to incorporate dataset(s) to your project.

A dataset is expected to be an ai-dataset-template repository.

Create

You will need to create a data/create issue if the dataset:

  • Is not in this repository format.
  • It doesn't exist yet and you will need to capture and annotate it.
  • You want to release a new version of an existing ai-dataset.

This concrete issue doesn't need to be associated with a branch/pull request and you can just close it posting a comment with the link to the version of the ai-dataset repository once it has been created.

Import

Once you have an ai-dataset repository, you will need to create and assign to someone a data/import issue. The assigment of this issue will automatically create an associated branch:

data/import

Both the data/create and data/import issues must be linked to it's corresponding milestone:

data/import milestone

Assuming you are working on the associated branch, you can add a new stage to dvc.yaml importing the dataset with dvc import by editing both configs/pipeline.yml and dvc.yaml. It's important to note that this will only work if you are in a machine with access to the DVC cache.

Adding a import stage

EXAMPLE_DATASET:
  version: v0.1.0
vars:
  - configs/pipeline.yml

stages:

  # . . .

  data-import-EXAMPLE_DATASET:
    cmd: dvc import
        "git@github.com:Gradiant/ai-dataset-EXAMPLE_DATASET.git"
        "annotations/EXAMPLE_DATASET.json"
        --out "results/data/import/EXAMPLE_DATASET.json"
        --rev  ${EXAMPLE_DATASET.version}

After that, in a terminal you should run:

dvc repro

This will clone and import a new file to the indicated output path (--out). The --rev argument is ussed to pin the dataset to a specific version (i.e. a release)

Once the dataset has been imported and pinned to a version, you can add, commit and push those changes:

git add results/data/import
git commit -m "Imported EXAMPLE_DATASET v0.1.0"
git push

Open a pull request, link it to the issue using keywords, and fill the corresponding data/import section:

data/import pull request

Note

When your ai-project requires more than one dataset, you need to repeat this process (open issue -> pull request) for each one of them.

Update

If later in the project you create a new version of the dataset, for example by adding more data, you can add a new stage to dvc.yaml updating the dataset with dvc update to pin the dataset to this new version:

Adding a update stage

EXAMPLE_DATASET:
  version: v0.2.0
vars:
  - configs/pipeline.yml

stages:

  # . . .

  data-update-EXAMPLE_DATASET:
    cmd: dvc update
        --rev  ${EXAMPLE_DATASET.version}
        "results/data/import/EXAMPLE_DATASET.json.dvc"

Once the dataset has been updated to the new version, you can add, commit and push those changes:

git add results/data/import
git commit -m "Updated EXAMPLE_DATASET to v0.2.0"
git push