Local and secure image recognition is fairly trivial in terms of power consumption, but hey, there's likely going to be some option to turn it off, just like hardware acceleration for video and image rendering, which uses the same GPU in similar ways. The power consumption argument is not invalid, but the way people deploy it is baffling to me, and is often based on worst-case estimates that are not realistic by design.
To be clear, Apple is building CPUs that can parse these queries in seconds into iPads now, running at a few tens of watts. Each time I boot up Tekken on my 1000W gaming PC for five minutes I'm burning up more power than my share of AI queries for weeks, if not months.
On the second point I absolutely disagree. There is no practical advantage to making accessibility annoying to implement. Accessibility should be structural, mandatory and automatic, not a nice thing people do for you. Eff that.
As for the third part, every alt text I've seen deployed is not adding much of value beyond a description of the content. What is measurable and factual is that the coverage of alt-text, even in places where it's disproportionately popular like Mastodon, is spotty at best and residual at worst. There is no question that automated alt-text is better than no alt-text, and most content has no alt-text.
That is only the tip of the iceberg for ML applied to accessibility, too. You could do active queries, you could have users be able to ask for additional context or clarification, you could have much smoother, automated voice reading of text, including visual description on demand... This tech is powerful in many areas, and this is clearly one. In fact, this is a much better application than search, by a lot. It's frustrating that search and factual queries, where this stuff is pretty bad at being reliable, are the thing everybody is thinking about.