2. Create/Import/Update Dataset
Once the the milestone has been defined, the next step will be to incorporate dataset(s) to your project.
A dataset is expected to be an ai-dataset-template
repository.
Create
You will need to create a data/create
issue if the dataset:
- Is not in this repository format.
- It doesn't exist yet and you will need to capture and annotate it.
- You want to release a new version of an existing
ai-dataset
.
This concrete issue doesn't need to be associated with a branch/pull request and you can just close it posting a comment with the link to the version of the ai-dataset
repository once it has been created.
Import
Once you have an ai-dataset
repository, you will need to create and assign to someone a data/import
issue. The assigment of this issue will automatically create an associated branch:
Both the data/create
and data/import
issues must be linked to it's corresponding milestone
:
Assuming you are working on the associated branch, you can add a new stage to dvc.yaml
importing the dataset with dvc import
by editing both configs/pipeline.yml
and dvc.yaml
. It's important to note that this will only work if you are in a machine with access to the DVC
cache.
Adding a import stage
EXAMPLE_DATASET:
version: v0.1.0
vars:
- configs/pipeline.yml
stages:
# . . .
data-import-EXAMPLE_DATASET:
cmd: dvc import
"git@github.com:Gradiant/ai-dataset-EXAMPLE_DATASET.git"
"annotations/EXAMPLE_DATASET.json"
--out "results/data/import/EXAMPLE_DATASET.json"
--rev ${EXAMPLE_DATASET.version}
After that, in a terminal you should run:
dvc repro
This will clone and import a new file to the indicated output path (--out
). The --rev
argument is ussed to pin the dataset to a specific version (i.e. a release)
Once the dataset has been imported and pinned to a version, you can add, commit and push those changes:
git add results/data/import
git commit -m "Imported EXAMPLE_DATASET v0.1.0"
git push
Open a pull request, link it to the issue using keywords, and fill the corresponding data/import
section:
Note
When your ai-project
requires more than one dataset, you need to repeat this process (open issue -> pull request) for each one of them.
Update
If later in the project you create a new version of the dataset, for example by adding more data, you can add a new stage to dvc.yaml
updating the dataset with dvc update
to pin the dataset to this new version:
Adding a update stage
EXAMPLE_DATASET:
version: v0.2.0
vars:
- configs/pipeline.yml
stages:
# . . .
data-update-EXAMPLE_DATASET:
cmd: dvc update
--rev ${EXAMPLE_DATASET.version}
"results/data/import/EXAMPLE_DATASET.json.dvc"
Once the dataset has been updated to the new version, you can add, commit and push those changes:
git add results/data/import
git commit -m "Updated EXAMPLE_DATASET to v0.2.0"
git push