18. Quantitative Evaluation

Posted Jun 4, 2025

By Mingyu An

6 min read

Quantitative vs Qualitative

Quantitative

Objectively measure human performance
more narrow focus
internal validity

Qualitative

Observe and develop understanding of human experience
Emphasis on preserving the richness of the non-numeric data
External validity

Quantitative Evaluation is somehow scientific approach..

Beyond user friendly.

Specify user and tasks

What to measure

Time to learn
speed of performance
error rate
user retention (how they remember from first learn)
Subjective satisfaction

Quantitative Evaluation

Methods

User event collections
ex. Mouse clicks…
Controlled experiment
- Testable hypothesis
- Independent / dependent variables
- Can be reproduced (reliability)

User Event Collection example

Undo & erase & self-report 비율 분석

→ Undo나 erase의 빈도만 보면 대부분의 self-report를 잡아낼 수 있다.

Controlled Experiment

Based on Practical problem and existing theory

Outcome
- Guidance for practitioners
- Refine theory
- Advice for experimenters

How to do it?

State a lucid, testable hypothesis
Identify independent and dependent variables (and controlled, randomized variable)
Design the experimental protocol
Choose the user population
Apply for human subjects protocol review (IRB review)
Run a couple user pilot tests
Run the experiment (Don’t forget to debrief)
Run statistical analysis
Draw conclusions

Note for Running the Experiment

Is it ethical? Is it useful?

Is it reliable?

Is it valid?

Does this experiment consider variations between subject?
- Enough samples, or blocked?
Was this experiment biased?
Does the experiment reflect real world situation?

Independent Variables

The things you manipulate independent of a subject’s behavior.

ex) device type

Dependent Variables

variables dependent of a subject’s behavior.
The things you set out to quantitative measure.

Control Variables

Circumstance not under investigation that is kept constant while testing the effect.
More control → Less generalizable → Low external validity.

Random Variables

Circumstance that is allowed to vary randomly.
More variability means low internal validity, but high external validity!

⇒ There is trade-off between control variables (internal) and random variables (external)!

Confounding Variable

고려하지 않았으나 실험 결과에 영향을 준 variable. (악영향)
External factor that can affect the results of experiment.
Circumstance that varies systemically with independent variables.
Should be controlled or randomized to avoid misleading results!

→ ex) GDP - Child’s weight

Correlates with both dependent and independent variables
Cause Type-I Error (거짓 양성, 실제로는 두 변수간 영향이 없는데 영향이 있는 것처럼 나옴)
Hurts internal validity

Then how to control confounding variables?

Case-control study
- 해당 case를 적용하지 않은 그룹을 만들어서 대조군 만들어 비교.
Cohort study
- Homogeneous한 group 하나를 select하여, 여러 시간에 걸쳐서 지속적 조사
Blocking
- Grouping experimental units that are similar.
Randomization
- 각 test subject를 랜덤한 그룹에 넣어서 confounding variable 자체가 골고루 퍼지도록. (sample 수 많이 필요)

Blocking

Group experimental units into block based on similarity.
Primary interest가 아닌, variability를 그룹화 시킨 뒤 각 그룹(block)에 절반 절반에 서로 랜덤하게 부여한다.

→ 성별, 사람 등에 의한 selection bias가 감소할 수 있음

ex) 두 종류의 깔창 테스트

Randomize: User, and assign random shoe for 5 people each.
Blocking: Make each user as block. Assign left or right shoe.

Best practice

Should randomize in each block
Measure difference within the block
Block what you can; randomize what you cannot

Design the experimental protocol

Between subjects:

each subject runs one condition

Pros:

Compare observed results between independent groups.
Can eliminate ordering and learning effects.

Cons:

Need more subjects
Difference between subjects might introduce a bias.
Might allow confounding variables

Within subjects:

each subject runs all conditions; comparing result within each subject.

Pros:

Need less subjects
can eliminate variation (and confounding variable) due to difference between subject.

Cons:

Be aware of ordering effect and, learning effect

Run Experiment

뭐 이거 어떻게 하는지는 Evaluation with user에서 잘 정리했지만…

IRB Review Process를 꼭 거쳐야 한다!

Always run pilot first!
All subject should follow the same steps

Choose the User Population

Pick a well-balanced sample

Novice, experts…
age… gender…

→ or maybe an independent model.

Control subject variability

→ Number of subjects (power analysis) N=20~30 이런거 정하기, 사전 논문 조사나 small scale로 구할 수 있음

Run Statistical Analysis

Properties of our population

Mean Variance…

How different data sets relate to each other?

Probability that our claims are correct

Note: 유의수준(level of statistical significance) → p-value < alpha이면 우연으로 보기 힘들다. 통계적으로 유의미한 결과이다. 귀무가설을 기각한다.

What test to use?

Parametric test… → 특정 조건들이 몇개 있음

Nonparametric test → 통계적으로 좀 안좋음

Analysis of Variance (ANOVA)

독립변수가 종속변수에 유의미한 영향을 끼쳤는가?

즉, 평균이 유의미하게 다른가?

→ 근데 왜 ANO”VA” 죠??

평균만 가지고는 이게 ‘유의미한’ 차이가 있는지 알 수 없음. 분산이 유의미하게 작아야 두 평균이 ‘유의미하게’ 차이 있는지 알 수 있음.

F 검정의 F-value: 그룹간 분산이 그룹 내 분산보다 큰 정도 = 크면 유의미하게 차이 있다.

P-value: 귀무가설을 기각할 확률. 통계적으로 유의미할 확률. 우연이 아닐 확률

s: significant

ns: not significant

Quantitative Approach outcome

Focus on low-level effect

Captures patterns of use

Pros

Provide objective measurements
- Strong internal validity

Cons

Low external validity
Small differences may lack practical significance (even though it has high statistic significance)

Qualitative Approach outcome

Focus on high-level effect

Identifies user flow
Highlights ambiguities in task description
Reveal contextual insights