So, I’m selfhosting immich, the issue is we tend to take a lot of pictures of the same scene/thing to later pick the best, and well, we can have 5~10 photos which are basically duplicates but not quite.
Some duplicate finding programs put those images at 95% or more similarity.

I’m wondering if there’s any way, probably at file system level, for the same images to be compressed together.
Maybe deduplication?
Have any of you guys handled a similar situation?

  • simplymath@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    Compressed length is already known to be a powerful metric for classification tasks, but requires polynomial time to do the classification. As much as I hate to admit it, you’re better off using a neural network because they work in linear time, or figuring out how to apply the kernel trick to the metric outlined in this paper.

    a formal paper on using compression length as a measure of similarity: https://arxiv.org/pdf/cs/0111054

    a blog post on this topic, applied to image classification:

    https://jakobs.dev/solving-mnist-with-gzip/

    • smpl@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      0
      ·
      4 months ago

      I was not talking about classification. What I was talking about was a simple probe at how well a collage of similar images compares in compressed size to the images individually. The hypothesis is that a compression codec would compress images with similar colordistribution in a spritesheet better than if it encode each image individually. I don’t know, the savings might be neglible, but I’d assume that there was something to gain at least for some compression codecs. I doubt doing deduplication post compression has much to gain.

      I think you’re overthinking the classification task. These images are very similar and I think comparing the color distribution would be adequate. It would of course be interesting to compare the different methods :)

      • smpl@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 months ago

        Wait… this is exactly the problem a video codec solves. Scoot and give me some sample data!

        • simplymath@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          4 months ago

          Yeah. That’s what an MP4 does, but I was just saying that first you have to figure out which images are “close enough” to encode this way.