Every Picture Tells a Story: Now More Accurately Than Ever
Many of us rely on the internet for images we need in our daily work or correspondence. But all too often, the search doesn’t always hit the mark — despite continuous advances in artificial intelligence. For instance, you need a photo of a female CEO but most of the images popping up are of men. You’d like to playfully send your girlfriend a picture of roses in her favorite color, lavender, but you can only find red.
Better results could be coming your way, thanks to a team of Technion doctoral students and alumni working with Technion Assistant Professor Yonatan Belinkov that is researching methods to quickly fix the problems by weeding out biases and wrong assumptions.
Image generator models are trained on vast amounts of image-text pairs — for example, matching the text “picture of a dog” to a picture of a dog, repeated millions of times. “Since these models are trained on a lot of data from the real world, they acquire and internalize assumptions about the world,” said Hadas Orgad, an Apple AI doctoral fellow and student from the Henry and Marilyn Taub Faculty of Computer Science, and Bahjat Kawar, a Taub Faculty graduate who is now a computer vision researcher at Apple.
“Some of these assumptions are useful, for example, ‘the sky is blue,’ and they allow us to obtain beautiful images even with short and simple descriptions.” But incorrect assumptions and societal biases are also encoded during the AI training process. And as world events change quickly, text-image matches of heads of state or actors portraying important characters, for example, soon become outdated.
Traditional solutions to these problems – such as changing the data constantly or fine-tuning – are expensive and eat up a lot of computer time and energy. And even then, don’t always come up with the correct answers. “Therefore, we would like a precise method to control the assumptions that the model encodes,” said the researchers.
The team developed a method called TIME (Text-to-Image Model Editing) that does not require fine-tuning, retraining, or changing the language model, but only a partial re-editing of approximately 2% of the model’s parameters. It allows for the efficient correction of biases and assumptions in less than a second. Orgad, working with fellow doctoral student Dana Arad, also created a second method called ReFACT that offers a different algorithm for editing an even smaller percentage of the model’s parameters and achieving more precise results.
TIME was presented in October 2023 at the International Conference on Computer Vision. Ongoing research based on TIME, developed in collaboration with Northeastern University and Massachusetts Institute of Technology, is exploring a way to control a variety of undesirable ethical behaviors by removing unwanted associations, including offensive content. ReFACT was presented in Mexico at one of the leading conferences in natural language processing research.