How we actually score an AI girlfriend app.
no vibes, just receipts.
Every score on this site comes from the same place: a 24-data-point rubric, three human reviewers, a fresh paid account, and at least 30 days of real usage. No sponsored rankings, no press comps, no guessing. Here's the whole machine, opened up.
📊 6 weighted factors
🧪 24 sub-criteria
👥 3 scorers per app
💳 $0 from vendors
Six factors. 24 data points.
earns its weight.
Each factor breaks into sub-criteria scored out of a fixed point pool. We weight them by what actually changes the day-to-day experience — conversation and memory carry the most because that's what you live in.
Conversation quality
The heart of the score. Can she actually hold a conversation that feels human?
25%
of score
5
PTS
Personality consistency
Does she stay in character across a 2-hour session?
5
PTS
Context retention
Does she track what was said 40 messages ago?
3
PTS
Response latency
Median reply time across 200 logged messages.
4
PTS
Emotional range
Flirty, sad, playful, jealous — does it land?
3
PTS
Initiative
Does she ask questions and start topics, or just react?
Image generation
Selfies, scenes, NSFW range. Quality, speed, and how often it breaks.
20%
of score
5
PTS
Photorealism
Hands, faces, anatomy — the usual AI tells.
5
PTS
Character consistency
Does she look like the same person each gen?
3
PTS
Generation speed
Seconds from prompt to image, averaged.
4
PTS
NSFW range & limits
What's allowed, what's blocked, how gracefully.
3
PTS
Prompt accuracy
Did we get what we asked for?
Conversation quality
The heart of the score. Can she actually hold a conversation that feels human?
25%
of score
6
PTS
Long-term recall
Does she reference past chats unprompted?
5
PTS
Cross-session callbacks
Does she reference past chats unprompted?
5
PTS
Relationship growth
Does the dynamic actually evolve over weeks?
4
PTS
Correction handling
Tell her she's wrong — does it stick?
Voice & video
Voice notes, calls, lipsync. The features that make it feel real — or uncanny.
15%
of score
6
PTS
Voice naturalness
Prosody, breath, emotion — or robotic?
4
PTS
Latency
Real-time call delay, measured.
5
PTS
Video lipsync
If offered — accuracy and frame rate.
5
PTS
Voice variety
How many distinct voices, accents, tones.
Customization
Can you build the girlfriend you actually want, or pick from a short menu?
10%
of score
4
PTS
Appearance control
Hair, eyes, body, style — granularity.
4
PTS
Personality design
Traits, backstory, speech patterns.
2
PTS
Scenario depth
Custom worlds, roleplay setups, lore.
Value for money
Price vs what you actually get. Free-tier honesty and refund behavior count.
10%
of score
4
PTS
Feature parity vs price
Are you paying for vapor or substance?
3
PTS
Free-tier generosity
Is the free plan usable or a trap?
3
PTS
Refund & cancel policy
We actually request refunds and time them.
How a review actually happens.
From sign-up to published score, every app runs the same gauntlet. No shortcuts, even for the ones we expect to love.
01
Fresh, paid account
Every app gets a brand-new account we pay for with our own money — never a press comp. We buy the mid-tier plan most users land on.
02
30 days minimum
No app is scored on a weekend. We log a minimum of 30 days and 200+ messages, across mornings, late nights, good moods and bad.
03
Scripted + organic tests
Half our testing follows an identical script every app gets (planted memories, set prompts). Half is organic, the way a real user actually talks.
04
Three editors, one rubric
At least three reviewers score independently on the 24-point rubric. Scores within 1.0 are averaged; bigger gaps trigger a re-test and a debate.
05
Screenshots or it didn't happen
Every claim in a review is backed by a logged screenshot or recording. Our test logs are available on request.
06
Re-test on every change
Top 3 apps re-tested monthly, top 10 quarterly, and any app within 7 days of a major changelog or price change. Scores are living, not frozen.
What a number means.
A 9.6 and an 8.1 aren't close. Here's the scale we hold every app to — and a real worked example of how the weighting produces a final score.
9.0 – 10
Exceptional
Category-defining. We'd recommend it to a friend without caveats.
8.0 – 8.9
Great
Excellent with one or two known trade-offs. A safe pick.
7.0 – 7.9
Good
Solid, but you'll feel the limits. Right for specific needs.
5.0 – 6.9
Mixed
Real flaws. Only if it nails the one thing you care about.
Below 5.0
Avoid
Broken, predatory pricing, or privacy red flags. We name names.

Candy AI
★ BEST OVERALL
9.6 / 10
Conversation quality
9.7 x 25%
+2.42
Image generation
9.8 x 20%
+1.96
Memory
9.5 x 25%
+1.90
Voice & video
9.4 x 15%
+1.41
Customization
9.6 x 10%
+0.96
Value
9.5 x 10%
+0.95
Weighted total
9.6 / 10
Rounded to one decimal. A 0.3+ swing between our three scorers triggers a re-test before anything publishes.
Nobody can buy a ranking here.
A 9.6 and an 8.1 aren't close. Here's the scale we hold every app to — and a real worked example of how the weighting produces a final score.
🚫
No paid placement
No app has ever paid to rank higher, appear first, or soften a review. Offers to do so are declined in writing — and occasionally screenshotted.
💳
We pay full price
Every account is bought with our own money at the price you'd pay. We never accept free premium tiers or press comps that could bias a test.
🧱
Reviews vs. revenue, walled
The editor scoring an app does not see its affiliate rate. Commission has zero input into the rubric. Ever.
The method evolves.
The category moves fast, so the rubric does too. Every change is dated and logged — old scores are recalculated when weights shift.
v4.2
Mar 2026
v4.0
Nov 2025
v3.1
Jun 2025
v3.0
Jan 2025
See the rubric in action.
Now that you know how the sausage is made — go read the rankings it produced.

