Percy Liang: The Stanford Researcher Who Built the Gold Standard for AI Evaluation

In the race to build ever-larger language models, one Stanford professor dared to ask an unfashionable question: how do we actually know these systems work? While billion-dollar labs competed for benchmark supremacy, Percy Liang quietly assembled the most comprehensive evaluation framework the field had ever seen. His creation, HELM — Holistic Evaluation of Language Models — became the standard by which the entire AI industry measured itself. But Liang’s influence extends far beyond benchmarks. As the founder of Together AI and a driving force behind open-source AI infrastructure, he has shaped a vision of artificial intelligence that is transparent, reproducible, and accountable. In an era defined by hype and secrecy, Percy Liang built the instruments of scientific rigor.

Early Life and Education

Percy Liang grew up with a deep fascination for mathematics and computer science. His academic path led him to the Massachusetts Institute of Technology, where he earned his bachelor’s degree. MIT’s culture of rigorous problem-solving and interdisciplinary thinking left a lasting impression on Liang, instilling in him the belief that the hardest problems in computing required both theoretical depth and practical engineering.

For his doctoral work, Liang moved to the University of California, Berkeley, where he pursued a PhD in computer science. At Berkeley, he worked at the intersection of machine learning and natural language processing, studying how statistical methods could be applied to understand and generate human language. His dissertation research focused on semantic parsing — the task of converting natural language utterances into formal meaning representations that machines could execute. This was not a fashionable topic at the time; deep learning was still years away from its breakthrough moment. But Liang recognized that the core challenge of AI was not raw pattern recognition but structured understanding of meaning.

His time at Berkeley also exposed him to the open-source ethos of the Bay Area’s research community. The idea that scientific progress depended on shared tools and reproducible results would become a cornerstone of his later work. After completing his PhD, Liang joined Stanford University’s Department of Computer Science, where he would spend the next decade building one of the most influential NLP research groups in the world.

Career and the Creation of HELM

At Stanford, Percy Liang established the Center for Research on Foundation Models (CRFM), a research initiative dedicated to studying the capabilities, limitations, and societal implications of large-Scale AI models. By the early 2020s, the AI landscape had shifted dramatically. Models like GPT-3, PaLM, and Claude were demonstrating remarkable abilities, but the methods used to evaluate them were fragmented, inconsistent, and often misleading. A model might score well on one benchmark while failing catastrophically on another, and there was no standardized way to compare systems across the dimensions that actually mattered — accuracy, fairness, robustness, calibration, and efficiency.

This gap between capability and accountability drove Liang to create HELM: the Holistic Evaluation of Language Models. Released in 2022, HELM was not merely another leaderboard. It was a philosophical statement about what evaluation should mean in the age of foundation models.

Technical Innovation

HELM’s design reflected Liang’s insistence on completeness. Rather than testing models on a handful of tasks, HELM evaluated them across dozens of scenarios spanning question answering, summarization, information retrieval, sentiment analysis, toxicity detection, and more. Each scenario was assessed along multiple metrics — not just accuracy, but calibration (does the model know when it doesn’t know?), fairness (does performance degrade for underrepresented groups?), robustness (does it handle typos and adversarial inputs?), and efficiency (how much compute does it require?).

The framework introduced a taxonomy that organized evaluation into three layers: scenarios, metrics, and adaptation strategies. This allowed researchers to systematically compare how different models performed under different prompting strategies — zero-shot, few-shot, and chain-of-thought — providing a far richer picture than single-number benchmarks ever could.

# Simplified example: HELM-style multi-metric evaluation
# Demonstrates evaluating a model across multiple dimensions

class HolisticEvaluation:
    def __init__(self, model, scenarios):
        self.model = model
        self.scenarios = scenarios
        self.metrics = {
            "accuracy": self.compute_accuracy,
            "calibration": self.compute_calibration,
            "fairness": self.compute_fairness,
            "robustness": self.compute_robustness,
            "efficiency": self.compute_efficiency
        }

    def evaluate_scenario(self, scenario, adaptation="zero_shot"):
        results = {}
        predictions = self.model.predict(
            scenario.inputs,
            strategy=adaptation
        )
        for metric_name, metric_fn in self.metrics.items():
            results[metric_name] = metric_fn(
                predictions, scenario.references,
                demographic_groups=scenario.metadata
            )
        return results

    def compute_calibration(self, preds, refs, **kwargs):
        """Expected Calibration Error (ECE):
        measures alignment between predicted
        confidence and actual correctness."""
        bins = self.bin_predictions(preds, n_bins=10)
        ece = sum(
            len(b) / len(preds) *
            abs(b.avg_confidence - b.avg_accuracy)
            for b in bins
        )
        return {"ece": ece}

    def run_full_evaluation(self):
        report = {}
        for scenario in self.scenarios:
            for strategy in ["zero_shot", "few_shot", "chain_of_thought"]:
                key = f"{scenario.name}_{strategy}"
                report[key] = self.evaluate_scenario(scenario, strategy)
        return report

HELM was fully open-source and designed for reproducibility. Every evaluation run was logged, every prompt was recorded, and every result was publicly accessible. This transparency was revolutionary in an industry where most benchmarking happened behind closed doors and companies could cherry-pick their best numbers.

Why It Mattered

Before HELM, model evaluation in NLP was a patchwork. Researchers would report results on SuperGLUE, MMLU, or HellaSwag, but these benchmarks each captured only a sliver of what a language model could or could not do. Companies used this fragmentation strategically, highlighting whichever benchmark made their model look best. HELM changed the incentive structure by making it impossible to hide weaknesses.

The framework quickly became adopted by leading AI labs, policymakers, and academic researchers. It influenced how organizations like the National Institute of Standards and Technology (NIST) thought about AI assessment standards. The European Union’s AI Act discussions referenced the need for exactly the kind of holistic evaluation that HELM provided. For the first time, there was a shared language for discussing model quality that went beyond raw performance.

Liang’s work on HELM also connected to the broader mission of researchers like Stuart Russell, who argued that AI systems must be designed with human values at their center. Where Russell provided the philosophical framework, Liang provided the measurement infrastructure.

Other Major Contributions

While HELM is Liang’s most widely known creation, his research portfolio spans decades and covers multiple foundational areas of NLP and AI.

Semantic Parsing and Natural Language Interfaces. Liang’s early work at Berkeley and Stanford focused on teaching machines to convert natural language into executable programs. His systems could take a question like “What is the population of the largest city in Europe?” and translate it into a database query. This line of research influenced virtually every voice assistant and natural language interface built in the following decade. The approach of grounding language in formal semantics — rather than treating it as a pure statistical problem — prefigured the structured reasoning capabilities that modern language models struggle to achieve reliably.

Foundation Models Report. In 2021, Liang co-led a landmark 200-page report titled “On the Opportunities and Risks of Foundation Models,” which assembled over 100 researchers to map the landscape of large-scale AI. The report coined the term “foundation models” to describe systems like GPT and BERT that serve as the base for countless downstream applications. It systematically analyzed the technical capabilities, societal implications, economic effects, and governance challenges of these systems. The report became one of the most cited papers in AI policy and shaped how governments, universities, and industry leaders thought about the future of artificial intelligence.

Together AI. Recognizing that open-source AI needed not just research but infrastructure, Liang co-founded Together AI, a company dedicated to making it practical and affordable to train, fine-tune, and deploy open-source models. Together AI built a decentralized computing platform that allowed researchers and companies to access GPU clusters without the massive capital expenditure typically required. The company’s work lowered the barrier to entry for AI development, enabling startups and academic labs to compete with tech giants. This effort paralleled the broader democratization movement championed by researchers like Yoshua Bengio, who had long advocated for open access to AI tools and knowledge.

Alpaca and Open-Source Language Models. Liang’s Stanford lab produced Alpaca, a fine-tuned language model based on Meta’s LLaMA, demonstrating that a high-quality instruction-following model could be created for under $600 in compute costs. This finding sent shockwaves through the industry, challenging the narrative that only billion-dollar companies could produce useful language models. Alpaca inspired a wave of open-source model development and contributed to the ecosystem that researchers like Ilya Sutskever and others at major labs had to increasingly engage with.

# Example: HELM benchmark configuration
# Defining a holistic evaluation run across scenarios

evaluation:
  name: "helm-v1.0-full"
  models:
    - name: "openai/gpt-4"
      api_endpoint: "https://api.openai.com/v1/completions"
    - name: "meta/llama-2-70b"
      api_endpoint: "https://api.together.xyz/v1/completions"
    - name: "anthropic/claude-2"
      api_endpoint: "https://api.anthropic.com/v1/complete"

  scenarios:
    - name: "question_answering"
      datasets: ["natural_questions", "triviaqa", "quac"]
      metrics: [exact_match, f1, calibration_ece]
      adaptations: [zero_shot, five_shot]

    - name: "summarization"
      datasets: ["xsum", "cnn_dailymail"]
      metrics: [rouge_l, faithfulness, extractive_coverage]
      adaptations: [zero_shot]

    - name: "toxicity_detection"
      datasets: ["real_toxicity_prompts", "bold"]
      metrics: [toxicity_probability, demographic_parity]
      adaptations: [zero_shot]

    - name: "reasoning"
      datasets: ["gsm8k", "math", "bbh"]
      metrics: [exact_match, chain_of_thought_accuracy]
      adaptations: [zero_shot, chain_of_thought]

  fairness_analysis:
    demographic_splits: [gender, race, age_group]
    disparity_threshold: 0.05

  output:
    format: "json"
    publish_to: "https://crfm.stanford.edu/helm/results"
    include_all_prompts: true
    reproducibility_hash: true

Philosophy and Approach

Percy Liang operates at the intersection of engineering ambition and scientific caution. In a field where bold claims often outpace evidence, his approach is methodical, principled, and deeply committed to transparency. His philosophy can be understood through several core beliefs that have guided his work for over a decade.

Key Principles

Measurement before deployment. Liang believes that no AI system should be widely deployed without rigorous, multi-dimensional evaluation. This is not a conservative stance — it is a scientific one. Just as pharmaceutical companies must demonstrate safety and efficacy before releasing drugs, AI developers should demonstrate that their models meet clear standards across accuracy, fairness, robustness, and efficiency.
Transparency as a non-negotiable. Throughout his career, Liang has insisted that AI research must be open and reproducible. This means publishing not just results but the exact prompts, evaluation pipelines, and data used to generate them. HELM embodies this principle: every evaluation is fully documented and publicly accessible. This stands in stark contrast to the growing trend of secrecy among AI labs, a concern also voiced by Daniela Amodei and others working on AI safety.
Holistic thinking over reductionism. Liang rejects the idea that a single number can capture the quality of an AI system. His multi-metric approach insists that accuracy alone is insufficient — a model that is highly accurate but poorly calibrated, biased, or brittle is not a good model. This perspective has influenced how the entire industry thinks about evaluation, much as Demis Hassabis broadened the definition of what AI could achieve beyond narrow game-playing.
Open-source as infrastructure for progress. Liang sees open-source AI not as a philosophical luxury but as a practical necessity. Without open models, open benchmarks, and open infrastructure, the field risks becoming a oligopoly where a handful of companies control the most powerful technology ever created. His work with Together AI is a direct expression of this belief.
Bridging research and policy. Unlike many academics who remain in the ivory tower, Liang has actively engaged with policymakers, contributing to discussions about AI regulation, safety standards, and governance frameworks. His Foundation Models report became a key reference for legislators crafting AI policy in both the United States and Europe.

For teams building AI-powered products and needing to navigate evaluation and quality assurance, platforms like Taskee offer structured task management that helps engineering teams organize their model testing and deployment workflows systematically.

Legacy and Impact

Percy Liang’s contributions have reshaped how the AI community thinks about accountability. Before HELM, model evaluation was an afterthought — something researchers did to get papers published, not a first-class scientific discipline. Liang elevated evaluation to a central concern, arguing that without rigorous measurement, AI progress was an illusion. This message resonated deeply in an era of rapid capability growth and increasing public concern about AI safety.

His influence extends through multiple channels. At Stanford, his students and collaborators have gone on to lead research efforts at major AI labs and startups. The CRFM has become one of the most important centers for AI policy research in the world, producing work that directly influences how governments regulate AI systems. Together AI has grown into a significant force in the open-source AI ecosystem, providing the infrastructure that allows researchers without corporate backing to participate in the most important scientific endeavor of the century.

Liang’s emphasis on open science connects him to a lineage of tech pioneers who believed that the most powerful technologies should be shared, not hoarded. Just as Linus Torvalds made operating system development accessible to everyone through Linux and Git, Liang is working to ensure that AI development remains a collaborative, transparent enterprise. And like Andrew Tanenbaum, whose educational contributions shaped generations of computer scientists, Liang’s HELM framework is becoming a teaching tool that helps the next generation of AI researchers understand what rigorous evaluation looks like.

The long-term impact of Liang’s work may be measured not in the models he helped build but in the standards he helped establish. As AI systems become increasingly integrated into healthcare, education, finance, and governance, the question of how we evaluate them becomes existential. Percy Liang has provided the field with the tools and the intellectual framework to answer that question honestly. Organizations planning their AI strategy can benefit from consulting with experienced digital agencies like Toimi, which helps companies implement technology solutions with the same rigor and transparency that Liang champions.

Key Facts

Full name: Percy Liang
Education: BS from MIT; PhD in Computer Science from UC Berkeley
Position: Associate Professor of Computer Science, Stanford University
Founded: Stanford Center for Research on Foundation Models (CRFM)
Co-founded: Together AI — open-source AI infrastructure company
Created: HELM (Holistic Evaluation of Language Models)
Notable project: Stanford Alpaca — instruction-following model trained for under $600
Key publication: “On the Opportunities and Risks of Foundation Models” (2021) — co-led with 100+ researchers
Research focus: NLP, semantic parsing, AI evaluation, foundation models, AI safety
Awards: Multiple best paper awards at ACL, EMNLP, and NeurIPS

FAQ

What is HELM and why is it important for AI evaluation?

HELM, or Holistic Evaluation of Language Models, is a comprehensive benchmark framework created by Percy Liang and his team at Stanford. Unlike traditional benchmarks that test models on a single task or metric, HELM evaluates language models across dozens of scenarios and multiple dimensions including accuracy, calibration, fairness, robustness, and efficiency. Its importance lies in providing a standardized, transparent, and reproducible way to compare AI models, preventing companies from cherry-picking favorable benchmarks and giving researchers, policymakers, and the public an honest picture of what these systems can and cannot do.

How did Stanford Alpaca change the open-source AI landscape?

Stanford Alpaca, produced by Percy Liang’s lab, demonstrated that a high-quality instruction-following language model could be created by fine-tuning an existing open-source model for under $600 in compute costs. This finding was transformative because it challenged the prevailing assumption that only companies with billions of dollars in resources could produce useful AI models. Alpaca inspired a wave of open-source model development, lowering the barrier to entry for academic researchers, startups, and independent developers who wanted to build and experiment with language models without massive corporate backing.

What is the Stanford Center for Research on Foundation Models?

The Center for Research on Foundation Models (CRFM) is a research initiative at Stanford University founded by Percy Liang. It brings together researchers from computer science, linguistics, philosophy, law, and policy to study large-scale AI models — called foundation models — that serve as the basis for a wide range of applications. CRFM’s work spans technical research (evaluation, safety, capabilities) and societal analysis (economic impact, governance, ethics). The center has produced influential reports and tools, including HELM, that have shaped how the AI industry and governments think about the development and regulation of advanced AI systems.

How does Percy Liang’s work connect AI research to public policy?

Percy Liang is one of the most prominent AI researchers who actively bridges the gap between technical research and public policy. His Foundation Models report, co-authored with over 100 researchers, provided a comprehensive analysis of the opportunities and risks of large-scale AI that became a key reference for policymakers worldwide. HELM’s transparent evaluation framework gave regulators a concrete tool for assessing AI systems. Liang has participated in policy discussions at the national and international level, advocating for evaluation standards, transparency requirements, and governance frameworks that ensure AI development serves the public interest rather than just commercial objectives.

Percy Liang: The Stanford Researcher Who Built the Gold Standard for AI Evaluation

Early Life and Education