top of page
personalitylab.avif

The Personality Lab

A system for building, configuring, and measuring distinct AI personas for deliberative research.

Overview

When Harvard's Berkman Klein Center needed a way to design distinct, measurable AI personas for their deliberative research platforms, no tool existed to do it. So I built one.

The Personality Lab is a custom tool and methodology for building, calibrating, and validating AI personas from the ground up. 

The process moves through three stages: an extensive persona template structures the initial build, a 5-axis behavioral scale calibrates the result, and a 15-question evaluation instrument stress-tests and refines the final persona before deployment.


Fictional ensemble archetypes and real personality profiles serve as psychological anchors throughout—giving each persona a coherent, recognizable behavioral foundation that makes AI feel like someone, not something. Personas that pass through the full process can be deployed where they power AI-mediated civic dialogue as part of ongoing deliberative research.

My Role

I conceived and built the Personality Lab from scratch in my role as Senior UX Designer at Harvard’s Applied Social Media Lab—the only person working on it. 

 

This meant defining the entire persona framework, inventing the measurement methodology, and designing every instrument used to build, calibrate, and validate AI personalities. There was no blueprint. I created the system, then used it.

Tools and methods

 

  • Custom persona template (extensive build instrument)

  • 5-axis behavioral scale (7-point, psychometrically grounded)

  • 15-prompt standardized evaluation battery

  • Character cluster validation (fictional ensemble and real personality profiling)

  • 3-layer architecture (Sensors / Decision Layer / Voice Synthesizer)

  • Historiesis-based reasoning model

The Problem

Bots tend to have personality the way a font has personality—surface level.  You can change the tone, but the underlying reasoning doesn’t change. I wanted to understand what it would actually take to make different AI configurations behave differently in testable, reproducible ways.


The distinction matters. Tone is aesthetic. Behavior is architectural. A bot that sounds warm but reasons the same way as a bot that sounds clinical isn’t a different persona—it’s the same persona wearing different clothes. What I needed was a system where personality lives in the decision layer, not just the voice.

Intellectual Lineage

I came across Anthropic’s Assistant Axis research—they were measuring controllable personality dimensions in LLMs. That gave me a structural vocabulary. But their axes were about stylistic adaptation. Mine needed to be about behavioral outcomes—specifically, what configurations produce psychological safety and engagement in live event contexts, without causing persona drift.


So I took their framework, mapped it onto what I was actually observing, added the dimensions that were missing—particularly safety versus efficiency, and how grounded versus exploratory a bot should be—and landed on five axes, each mapped to a measurable behavior in the system. These aren’t abstract personality traits. They’re architectural parameters.

The Five Axes

Each persona is configured on five behavioral axes, scored on a 7-point scale. The 7-point range isn’t arbitrary—it’s grounded in Miller’s Law, the psychometric research showing humans can reliably distinguish roughly seven categories, and mirrors the Likert scale standard in personality measurement. Seven points gives you a true midpoint, three degrees of intensity on each side, and enough range to detect meaningful movement without introducing false precision.

The philosophical backbone of the framework is historiesis—the idea that a model’s prior history shapes how it reasons through new situations, the way judicial philosophy shapes how a judge interprets new cases. Two models with identical training but different configured histories will reason differently about the same problem. That’s not tone. That’s architecture.

The Validation Instrument

To test whether different axis configurations actually behave differently, I developed a 15-prompt standardized evaluation battery. The prompts aren’t random—they systematically cover six interaction types most likely to surface behavioral differentiation between personas:

 

  • Problem-solving requests

  • Emotional support moments

  • Conceptual explanation

  • Ethical and judgment calls

  • Self-doubt and discouragement

  • Process and decision questions


Running the same 15 prompts across different axis configurations produces comparable, reproducible behavioral data. This is what separates persona design from persona measurement.

Character Clusters: Validation by Ground Truth

The first validation challenge is straightforward: how do you know the axes are doing real work? You need ground truth—characters whose personalities are so clearly differentiated that you can check your output against intuition. If the axis scores come out similar across clearly distinct characters, something is wrong with the model.


I developed character clusters—groups of known personalities mapped onto the five axes—with the theory that a well-chosen cluster will naturally weight against itself and find balance across the personality space. Each cluster anchors the system to characters with observable, legible behavioral differences.


I started with the ensemble from Real Genius. It’s a useful corpus because the characters represent genuinely distinct cognitive and social styles—and most people have a strong enough read on them to gut-check the results. From there, I stress-tested across very different domains:
 

  • Philosophers: Socrates, Buddha, Nietzsche, Marcus Aurelius, Emerson—testing the framework against historical figures with well-documented reasoning styles

  • Composers: Beethoven, Bach, Chopin, Debussy—transposing the axes from verbal to aesthetic disposition

  • Muppets: a deliberately playful cluster that tests whether the framework holds in contexts defined by warmth, chaos, and social role
     

If Socrates, Nietzsche, and Marcus Aurelius score meaningfully differently on the same axes as Muppet characters—and both feel intuitively right—that’s evidence the system is capturing something real about personality structure, not just reflecting the source material back.

The Research Hypothesis

A colleague independently built a 27-tone sentiment analysis and engagement scoring model. When we mapped it against the 5-axis framework, the alignment was clean.


That convergence suggests a testable hypothesis: personality configuration isn’t just aesthetic—it’s predictive of engagement outcomes. Different bot configurations will produce measurably different conversation quality. The infrastructure to test that hypothesis—the measurement schema, the configuration parameters, the validation instrument—is what the Personality Lab provides.


That’s a research thread worth pulling.

  • White LinkedIn Icon
  • Behance_logo
  • Instagram_logo
bottom of page