How we score · rubric v4.2

How we actually score an AI girlfriend app.
no vibes, just receipts.

Every score on this site comes from the same place: a 24-data-point rubric, three human reviewers, a fresh paid account, and at least 30 days of real usage. No sponsored rankings, no press comps, no guessing. Here's the whole machine, opened up.

📊 6 weighted factors

🧪 24 sub-criteria

👥 3 scorers per app

💳 $0 from vendors

24 DATA POINTS

Conversation 25%

Image generation 20%

Memory 20%

Voice & video 15%

Customization 10%

Value 10%

Apps tested

Logged usage

Days per app, min

Editors per score

Sponsored placement

The rubric

Six factors. 24 data points.
earns its weight.

Each factor breaks into sub-criteria scored out of a fixed point pool. We weight them by what actually changes the day-to-day experience — conversation and memory carry the most because that's what you live in.

Conversation quality

The heart of the score. Can she actually hold a conversation that feels human?

25%
of score

5
PTS

Personality consistency

Does she stay in character across a 2-hour session?

5
PTS

Context retention

Does she track what was said 40 messages ago?

3
PTS

Response latency

Median reply time across 200 logged messages.

4
PTS

Emotional range

Flirty, sad, playful, jealous — does it land?

3
PTS

Initiative

Does she ask questions and start topics, or just react?

Image generation

Selfies, scenes, NSFW range. Quality, speed, and how often it breaks.

20%
of score

5
PTS

Photorealism

Hands, faces, anatomy — the usual AI tells.

5
PTS

Character consistency

Does she look like the same person each gen?

3
PTS

Generation speed

Seconds from prompt to image, averaged.

4
PTS

NSFW range & limits

What's allowed, what's blocked, how gracefully.

3
PTS

Prompt accuracy

Did we get what we asked for?

Conversation quality

The heart of the score. Can she actually hold a conversation that feels human?

25%
of score

6
PTS

Long-term recall

Does she reference past chats unprompted?

5
PTS

Cross-session callbacks

Does she reference past chats unprompted?

5
PTS

Relationship growth

Does the dynamic actually evolve over weeks?

4
PTS

Correction handling

Tell her she's wrong — does it stick?

Voice & video

Voice notes, calls, lipsync. The features that make it feel real — or uncanny.

15%
of score

6
PTS

Voice naturalness

Prosody, breath, emotion — or robotic?

4
PTS

Latency

Real-time call delay, measured.

5
PTS

Video lipsync

If offered — accuracy and frame rate.

5
PTS

Voice variety

How many distinct voices, accents, tones.

Customization

Can you build the girlfriend you actually want, or pick from a short menu?

10%
of score

4
PTS

Appearance control

Hair, eyes, body, style — granularity.

4
PTS

Personality design

Traits, backstory, speech patterns.

2
PTS

Scenario depth

Custom worlds, roleplay setups, lore.

Value for money

Price vs what you actually get. Free-tier honesty and refund behavior count.

10%
of score

4
PTS

Feature parity vs price

Are you paying for vapor or substance?

3
PTS

Free-tier generosity

Is the free plan usable or a trap?

3
PTS

Refund & cancel policy

We actually request refunds and time them.

The process

How a review actually happens.

From sign-up to published score, every app runs the same gauntlet. No shortcuts, even for the ones we expect to love.

01

Fresh, paid account

Every app gets a brand-new account we pay for with our own money — never a press comp. We buy the mid-tier plan most users land on.

02

30 days minimum

No app is scored on a weekend. We log a minimum of 30 days and 200+ messages, across mornings, late nights, good moods and bad.

03

Scripted + organic tests

Half our testing follows an identical script every app gets (planted memories, set prompts). Half is organic, the way a real user actually talks.

04

Three editors, one rubric

At least three reviewers score independently on the 24-point rubric. Scores within 1.0 are averaged; bigger gaps trigger a re-test and a debate.

05

Screenshots or it didn't happen

Every claim in a review is backed by a logged screenshot or recording. Our test logs are available on request.

06

Re-test on every change

Top 3 apps re-tested monthly, top 10 quarterly, and any app within 7 days of a major changelog or price change. Scores are living, not frozen.

Reading the score

What a number means.

A 9.6 and an 8.1 aren't close. Here's the scale we hold every app to — and a real worked example of how the weighting produces a final score.

9.0 – 10

Exceptional

Category-defining. We'd recommend it to a friend without caveats.

8.0 – 8.9

Great

Excellent with one or two known trade-offs. A safe pick.

7.0 – 7.9

Good

Solid, but you'll feel the limits. Right for specific needs.

5.0 – 6.9

Mixed

Real flaws. Only if it nails the one thing you care about.

Below 5.0

Avoid

Broken, predatory pricing, or privacy red flags. We name names.

Candy AI
★ BEST OVERALL

9.6 / 10

Conversation quality

9.7 x 25%

+2.42

Image generation

9.8 x 20%

+1.96

Memory

9.5 x 25%

+1.90

Voice & video

9.4 x 15%

+1.41

Customization

9.6 x 10%

+0.96

Value

9.5 x 10%

+0.95

Weighted total

9.6 / 10

Rounded to one decimal. A 0.3+ swing between our three scorers triggers a re-test before anything publishes.

Independence

Nobody can buy a ranking here.

A 9.6 and an 8.1 aren't close. Here's the scale we hold every app to — and a real worked example of how the weighting produces a final score.

🚫

No paid placement

No app has ever paid to rank higher, appear first, or soften a review. Offers to do so are declined in writing — and occasionally screenshotted.

💳

We pay full price

Every account is bought with our own money at the price you'd pay. We never accept free premium tiers or press comps that could bias a test.

🧱

Reviews vs. revenue, walled

The editor scoring an app does not see its affiliate rate. Commission has zero input into the rubric. Ever.

Rubric changelog

The method evolves.

The category moves fast, so the rubric does too. Every change is dated and logged — old scores are recalculated when weights shift.

v4.2
Mar 2026

Raised weight on Memory (15%→20%) as long-term recall became table-
stakes. Added refund-timing sub-test to Value.

v4.0
Nov 2025

Added Voice & Video as a standalone factor after 60% of apps shipped voice.
Split from Conversation.

v3.1
Jun 2025

Introduced the planted-memory recall test (14-day callback) to standardize
Memory scoring.

v3.0
Jan 2025

Moved to the 24-data-point rubric. Mandated 3 independent scorers per app.

See the rubric in action.

Now that you know how the sausage is made — go read the rankings it produced.

Read the top-6 ranking Meet the team