The Pose Problem: Why Text Can't Describe What Your Eyes See in a Second
Insights

The Pose Problem: Why Text Can't Describe What Your Eyes See in a Second

·9 min read

The Pose Problem: Why Text Can't Describe What Your Eyes See in a Second

You're on the phone with a friend. You're trying to describe a pose. It goes badly.


"Okay, so she's standing with her weight on her left leg — no, her left, your right if you're facing her — and her right knee is bent slightly, maybe fifteen degrees? Her left hand is resting on her hip, but not like a teapot, more like she just placed it there casually. Her chin is tilted down about ten degrees, eyes looking up through the lashes, and her right arm is extended outward at roughly forty-five degrees from her torso, fingers slightly spread but relaxed, not jazz hands..."

You've been talking for thirty seconds. Your friend has no idea what this looks like. Neither do you, anymore.

This is the fundamental absurdity at the heart of AI image generation, and almost nobody talks about it. We obsess over model architectures, sampling methods, CFG scales. We debate whether FLUX or Stable Diffusion renders skin texture better. But the most basic challenge — telling the machine how a human body should be arranged in space — remains almost comically unsolved.

I've spent years working with reference photography and AI generation, and this gap between what we see and what we can say still fascinates me. It's not a technical problem. It's a linguistic one. Maybe even a philosophical one.


The Architecture of a Glance

Here's what happens when you look at a photograph of someone posing: in roughly 200 milliseconds, your visual cortex processes the entire spatial arrangement. The angle of the spine. The weight distribution. The relationship between the hands and the face. The twist of the torso relative to the hips. The micro-tension in the jaw.

All of it. A fifth of a second.

Now try to encode that into language.

Language is sequential. It's one word after another, like walking down a corridor. But a pose is simultaneous — everything happens at once, in three dimensions, with dozens of joints and angles and subtle weight shifts interacting. Writing a prompt for a pose is like trying to describe architecture by walking through a building with your eyes closed, narrating each step. You can do it. It takes forever. And the person listening will never build the same building in their head.

This isn't a new problem, actually. Musicians have dealt with a version of it for centuries. Try describing a melody using only words — not musical notation, not humming, just words. "It goes up a bit, then stays level, then drops suddenly, then there's this little ornamental flutter..." A trained musician might eventually reconstruct something from that. Everyone else will get noise.

Musical notation was invented precisely because verbal description fails for temporal, multi-dimensional information. Poses need their visual equivalent. And until recently, in the AI world, we didn't have one.


Why "Hands on Hips" Means Fifty Different Things

Type "a woman standing with her hands on hips" into any AI generator. Run it ten times.

You'll get ten different images. In some, the hands grip the waist firmly, elbows out wide — a power pose. In others, the fingertips barely rest on the hip bones, casual and relaxed. Some show the hands placed high, near the ribs. Others low, near the belt line. The body might face the camera straight-on, or at a three-quarter angle, or almost in profile.

This happens because "hands on hips" isn't an instruction. It's a category.

Think of it like a recipe that says "add some spice." A chef in Oaxaca and a chef in Kyoto will reach for very different jars. The instruction is technically clear — add spice — but the execution space is enormous. "Hands on hips" contains multitudes. The AI has seen thousands of images tagged with that description, and each one is slightly different. When you prompt it, you're essentially telling the model: "Pick any image from this enormous bucket." It picks. You get surprised.

The more specific your pose description, the fewer reference images the AI has to draw from — and paradoxically, the worse the results often get. Describe something very particular ("right hand resting on left shoulder, left arm hanging straight down, head tilted 20 degrees right, weight on the back foot") and the model starts struggling. It may never have seen that exact configuration labeled with those exact words. So it hallucinates something close. Close, but wrong.


A Small Experiment

I want you to try something right now.

Look at how you're sitting. Don't adjust anything — just notice your actual position. Now try to write a prompt that would generate this exact pose in an AI image.

Go ahead. I'll wait.

If you're like most people, you'll need at least four or five sentences. You'll mention which leg is crossed over which, how your back curves, where your hands are, what your head is doing. You might get stuck on details — is your left foot tucked under the chair or resting flat? Are your shoulders level or is one slightly higher from leaning on an armrest?

Now multiply that effort by five hundred. That's what it would take to build a comprehensive pose library through text descriptions alone. Five hundred distinct body positions, each requiring a paragraph of increasingly desperate anatomical choreography.

This is why I stopped trying.


What Photographers Figured Out a Long Time Ago

Here's the thing that amuses me about the AI community's "discovery" of reference images: professional photographers have been doing this since the invention of the camera.

Every working portrait photographer I know has a collection. Tear sheets from magazines. Screenshots from films. Pages bookmarked in posing guides. Pinterest boards so extensive they could fill a gallery. Before a shoot, you pull references. You show the model a picture on your phone: "Something like this." Three seconds. No verbal description needed. The model sees it, their body understands it in a way that thirty seconds of explanation never achieves.

This workflow — visual reference in, visual result out — is how the photography industry has operated for decades. It's efficient because it respects the nature of spatial information. You don't translate visual data into words and then back into visual data. You keep it visual the whole way through.

AI image generation is only now catching up to this obvious insight. Tools like image-to-image, ControlNet, pose estimation — these are all attempts to bypass the bottleneck of language. To let you show the machine what you mean instead of telling it.

But even with these tools, there's a missing piece.


The Space Between Showing and Telling

Pure reference images have their own limitation: they're too specific. If I feed ControlNet a photo of a model in a particular pose, I'll get that exact pose — same proportions, same angle, same everything. Sometimes that's what you want. But usually, you want the feeling of a pose, not its precise geometry.

What actually works — what I've found after hundreds of hours of iteration — is a combination. A reference image to establish the spatial architecture, paired with language that adjusts the details. The image says "the body is arranged roughly like this." The text says "but make it more relaxed, shift the weight forward, open the hands."

It's like how an architect works with both blueprints and written specifications. The blueprint captures spatial relationships that words can't. The specifications capture intent, materials, and tolerances that drawings can't. Neither is complete alone. Together, they build a building.

The most effective AI workflows I've seen treat pose information the same way: visual input for spatial arrangement, textual input for mood, energy, and refinement.

But this raises a practical question. Where do you get five hundred reference poses, each photographed clearly enough for AI extraction, each different enough to be worth having?


Building the Bridge

This is the problem I kept running into. I had the technical workflow figured out — reference image plus targeted prompt equals consistent results. But sourcing the references was a project in itself. Stock photos are inconsistent. Screenshots from films are copyrighted and poorly lit. My own photo library skewed toward the same handful of poses I always default to.

So I built what I wished existed: a library of 500 distinct poses, each photographed as a clean reference, each accompanied by a prompt that actually describes what's happening in the image. Not a vague category label like "confident stance" but a specific anatomical description that, combined with the visual, gives AI generators enough information to produce something predictable.

It took months. It forced me to develop a vocabulary for body positions that I didn't previously have — a kind of pidgin language somewhere between dance notation and anatomical terminology. "Contralateral weight shift with forward hip displacement" doesn't roll off the tongue, but paired with a photograph, it generates remarkably consistent results across different AI platforms.

The 500 AI Poses Guide is that library. And the reason it works isn't because the prompts are magic — it's because the prompts were written while looking at the specific photograph, which is the only way to close the gap between visual intent and verbal description.


The Gap Remains

I don't want to oversell this. The fundamental problem hasn't gone away. Language will never fully capture what your eyes process in 200 milliseconds. We're still working with an imperfect bridge between two incompatible formats — the spatial and the sequential, the simultaneous and the linear.

But you can make the bridge shorter.

Every time you pair a visual reference with a well-written prompt, you're compressing the translation distance. Every time you learn to describe a body position with anatomical precision instead of vibes, you're removing ambiguity that the model would otherwise fill with randomness.

The photographers who came before us understood this intuitively. They never tried to describe a pose when they could simply show it. They kept mood boards and clipping files and dog-eared pages in posing manuals. They knew that some information refuses to be flattened into words.

We're working with the same raw materials — light, bodies, composition — just with a different tool holding the camera. The pose problem is old. The tools are new. The solution, it turns out, is the same one photographers have been using all along.

You show. Then you tell. In that order.

Ready to Create Better AI Content?

Get professional prompt guides with reference photos — stop guessing, start creating.

Browse Guides

Related Guides