Skip to content

4. Duplicate detection

As an extension of the data exploration stage, a duplicate image detector has been added. Although like the other tools this one is executed using its respective script, it is composed of two independent phases that can be executed separately if desired. In the first phase, given annotations, the similarity of the pairwise images in the range [0,1] is calculated with 0 being two completely opposite images and 1 being two identical and therefore duplicate images. As output of this first phase a h5 file containing the similarity matrix of the images is generated. For the second phase, having this h5 file as input, the duplicate visualization tool is generated and saved in the specified target folder. In addition, in this second phase, the user is also allowed to introduce a pkl file containing the predictions from another model for the files contained in the annotation file, these predictions and the files in the annotation file must be sorted identically, then the tool will show not only the similarity but also the score given by the before mentioned model.

The interface of this tool is very simple, you will find 3 sliders and two buttons, the first slider is used for the user to decide from which threshold he wants to determine that two images are duplicates. The second slider is used to limit the number of images you want to render simultaneously because with very large datasets the display of duplicates can become a slow and costly process. The last slider serves to limit the number of images per row for each of the images so that if there are many images the visualization becomes more pleasant. The first button changes the order of the images 'Descending order' sorts the images from most similar to least similar and 'Ascending order' sorts the images from least similar to most similar. Finally the 'Show only duplicates' button hides the images that have no duplicates for a given threshold.

If this tool is used locally it is as simple as opening the generated HTML, however, if the tool is used on a server (e.g. GEA1) you must install the plugin for Visual Studio Code called Live Server, once installed, after generating the HTML and opening it in Visual Studio Code you can access the visualization tool by clicking the Go Live button at the bottom right of the Visual Studio Code interface.

Duplicate Detection