Basic Usage
Understanding DVC
DVC stands for Data Version Control. Why do we use it? First watch this video.
Now, spend some time thinking on it. We need to be able to track our data, this means that if we have 100 annotated images in January and we add some more in March and even more on July, we must be able to access each of the different states of the data trough time, as you do with git when you navigate through commits in time. As explained in the video, we cannot upload big files to Github, so what DVC basically does is creating references to data files (such as images, videos, or large annotation files), called metafiles. Unlike data files, these metafiles can be uploaded and commited to the git repository. So we save the history of this metafiles with git and then DVC is in charge of understanding the metafiles. Besides, in the same way that we clone code from a repository with git, DVC can bring the referenced data from wherever it is stored. Particularly, the ai-database-template configures DVC to search for data in the MM Network Attached Storage (NAS) as data remote.
First steps
Typically, in a MM AI project, you will have to deal with a multimedia dataset, where you have a set of multimedia files, as long as a set of json annotation files with ground truth metadata. Both multimedia files and annotation metadata files of large datasets are usually too heavy to be tracked using git. Therefore, the first step is to place your data in the remote storage location, which by default is the MM NAS. It is mounted at /media in all MM Servers. DVC will create reference metafiles to the selected data files and git will track their changes. For now, we are only tracking annotation files. Tracking multimedia files, such as images, requires large amounts of storage space that is currently not available at Gradiant.
Then, once you have completed the setup steps, you should now add your first version of the annotations file to the repository. There are two options for this:
A. Add an existing annotation file
If you already have an annotation file in COCO format, you will just have to manually copy it from the remote storage location to annotations/{{cookiecutter.dataset_slug}}.json
.
After that, use DVC to track the annotations file:
dvc add "annotations/{{cookiecutter.dataset_slug}}.json"
This will:
- add the .json
file to the DVC cache
- create the corresponding .json.dvc
metafile with the link to the data in the dvc cache.
- create a .gitignore
with a list of files not to be tracked by git, and incorporate the annotation .json
file in it.
B. Create a json annotation file in COCO format
In some cases you might need to do some coding to generate your json annotation file in COCO format, for example when: - the source annotation data is not in COCO format, - annotation data from several files needs to be merged or - there is no annotation data at all and you need to create it from the metadata in the source multimedia files.
You might also need to do some preprocessing of the source multimedia data, such as - image format conversion - image cropping - extraction of frames from videos - image subsampling - downloading the data to the MM NAS from some other remote source.
In such cases, it is strongly recomended to use a DVC pipeline to keep track of all the steps (stages) performed for multimedia data preprocessing and annotation file creation, together with the corresponding input parameters, in order to preciselly know how the dataset was created, and to be able to re-create the dataset when needed (for example, to reconstruct the dataset in case of data loss, or need to reduce data storage use).
To learn about dvc pipelines and check the pre-defined stages available in this template, check the corresponding tutorial. In short, you will need to add the different processing stages to the dvc.yaml
file of the DVC pipeline and then run it with:
dvc repro
When running a DVC pipeline, it is not necessary to manually add the annotation file to the dvc cache, DVC will do it automatically. A .dvc
metafile will be created for each output file listed in the outs
key of any stage of the dvc pipeline, and it will be added to the .gitignore
file so as not to be tracked by git. A dvc.lock
file will also be created, which tracks runs of the DVC pipeline with the corresponding configuration parameters. Both the dvc.yaml
and the dvc.lock
must be tracked with git.
git add `dvc.yaml` `dvc.lock`
Note
Please, remmember that tests need to be added for all of the code in the repository, including annotation creation and data pre-processing scripts.
Annotation format testing
Make sure that the annotation file is formatted according to the COCO schemas. Check the included examples for each supported task. Then, you need to test its format. To do so, modify the variable CATEGORY_NAMES
in annotations/SCHEMA.py
file with the list of category names used in your annotation file. Then execute:
. activate {{cookiecutter.dataset_slug}}
pytest -v tests/test_format.py
Release a new dataset version
You must now update the git repository with changes:
git add "annotations/.gitignore" "annotations/{{cookiecutter.dataset_slug}}.json.dvc"
git commit -m "Add {{cookiecutter.dataset_slug}}"
git push
dvc push
Once your changes are pushed, you can release your first version of the dataset, making it usable by ai-project-template
repositories.
Updating the annotations
Manual update
In order to update the annotations file (i.e. when new data is being added, fixing some existing annotations, extending the schema), you need to first unprotect the file (DVC makes the tracked files read-only in our cache configuration):
dvc unprotect "annotations/{{cookiecutter.dataset_slug}}.json"
Then you manually replace your annotation file and, if needed, manually update the corresponding COCO schema.
Finally, you will need to repeat the steps of adding the updated annotations to dvc and git:
dvc add "annotations/{{cookiecutter.dataset_slug}}.json"
git add "annotations/{{cookiecutter.dataset_slug}}.json.dvc"
git commit -m "Update {{cookiecutter.dataset_slug}}"
git push
dvc push
Update using DVC pipeline
In this case, you first update your data preprocessing and annotation creation scripts, the DVC pipeline (dvc.yaml
and corresponding config file), and COCO schema as needed. Then you just run your DVC pipeline again:
dvc repro
This will automatically update your annotation file, but it will also unprotect and update the affected .dvc
files in the DVC cache, so you just have to push your changes to the DVC cache:
dvc push
and update the changes in git.
git add "dvc.yaml" "dvc.lock" "configs/pipeline.yml" "src"
git commit -m "Updated {{cookiecutter.dataset_slug}}"
git push