Generative art models “remember” certain images, posing a risk to privacy


A bias once identified by researchers with GPT-2.

© Getty Images / Yuichiro Chino

Are image-generating artificial intelligences a privacy risk? The results published in a scientific article and relayed by Gizmodo raise the question.

A bias common to all models

A group of scientists from DeepMind, UC Berkeley, Princeton and ETH Zurich succeeded in generating synthetic images that are factually very similar to those studied by the model in its learning phase. As a reminder, generative artificial intelligences such as DALL-E or Stable Diffusion are trained on databases of several thousand images, a phase called deep learning. As part of their demonstrations, the researchers notably managed to find an original image of Anne Graham Lotz, an American Protestant evangelist, initially incorporated into the training data.

On the right, the original image

On the right, the original image “learned” by the AI, on the left the one finally generated.

© Screenshot

To obtain the quasi-original photos initially stored by the AI, the specialists asked the software several times to create an image with the same sentence. They then checked whether the latter was part of the AI ​​learning database. Of approximately 350,000 images generated, 94 direct matches and 109 near matches were identified. That is a memorization rate of about 0.03%, very low compared to all the images stored. All diffusion models have the same problem, to a greater or lesser degree.

The risk of medical data

Even though the reproduction rate of AI is relatively low, scientists fear that with the rise of models, more of the learned information will be regenerated in a raw way. “Maybe next year the new model that will come out will be much bigger and much more powerful, and these memorization risks will be much higher than today.“, assures Vikash Sehwag, doctoral candidate at Princeton University who participated in the study, quoted by Gizmodo.

An almost identical reproduction of the data stored by the AI.

An almost identical reproduction of the data stored by the AI.

© Screenshot

Eric Wallace, a doctoral student at the University of Berkeley, questions the deleterious consequences of this bias with the potential use of AI on a series of synthetic medical data from x-rays. Could we manage to find the original scans of the patients? “It’s pretty rare, so you might not notice it happening at first, and then you might actually deploy that dataset to the web”warns the scientist, who recalls that the objective of this research is “to anticipate these types of errors.”

Advertising, your content continues below



Source link -98