Content is user-generated and unverified.

AI Model Parameter Counts: A Comprehensive Analysis

The landscape of large language models has evolved dramatically since 2018, with parameter counts ranging from millions to potentially trillions. This comprehensive research reveals significant patterns in how companies approach model transparency, with early models typically having confirmed parameter counts while newer models increasingly keep this information proprietary.

Key Findings and Trends

Transparency shifts across the industry. Companies like OpenAI, Google, and Anthropic have moved from openly sharing parameter counts in their early models to keeping this information confidential for competitive reasons. OpenAI disclosed full details for GPT-1 through GPT-3 but has remained silent on GPT-4 and newer models. Similarly, Google openly shared PaLM's 540B parameters but keeps Gemini specifications under wraps.

The rise of efficient architectures. Many companies now use Mixture of Experts (MoE) architectures that activate only a fraction of total parameters per token. DeepSeek-V3, for instance, has 671B total parameters but uses only 37B per token, achieving remarkable efficiency. This architectural innovation allows models to scale without proportionally increasing inference costs.

Parameter count isn't everything. Google's PaLM 2 demonstrates this principle, achieving better performance than its 540B-parameter predecessor with only 340B parameters. Companies increasingly focus on training efficiency, data quality, and architectural innovations rather than raw parameter count.

Comprehensive Parameter Count Table

Confirmed Parameter Counts (Official Sources)

CompanyModelParametersRelease DateArchitectureSource
OpenAIGPT-1117MJune 2018TransformerOfficial paper
GPT-2 Small124MFeb 2019TransformerGitHub repo
GPT-2 Medium355MFeb 2019TransformerGitHub repo
GPT-2 Large774MFeb 2019TransformerGitHub repo
GPT-2 XL1.5BFeb 2019TransformerGitHub repo
GPT-3 Ada~350MMay 2020TransformerEleutherAI analysis
GPT-3 Babbage~1.3BMay 2020TransformerEleutherAI analysis
GPT-3 Curie~6.7BMay 2020TransformerEleutherAI analysis
GPT-3 Davinci175BMay 2020TransformerOfficial paper
MetaLlama 1 (7B)7BFeb 2023TransformerOfficial release
Llama 1 (13B)13BFeb 2023TransformerOfficial release
Llama 1 (30B)30BFeb 2023TransformerOfficial release
Llama 1 (65B)65BFeb 2023TransformerOfficial release
Llama 2 (7B)7BJuly 2023TransformerOfficial release
Llama 2 (13B)13BJuly 2023TransformerOfficial release
Llama 2 (70B)70BJuly 2023TransformerOfficial release
Code Llama (all sizes)7B, 13B, 34B, 70BAug 2023-Jan 2024TransformerOfficial release
Llama 3 (8B, 70B)8B, 70BApril 2024TransformerOfficial release
Llama 3.1 (8B, 70B, 405B)8B, 70B, 405BJuly 2024TransformerOfficial release
Llama 3.21B, 3B, 11B, 90BSept 2024Transformer/VisionOfficial release
Llama 3.370BDec 2024TransformerOfficial release
Llama 4 Scout109B total, 17B activeApril 2025MoEOfficial release
Llama 4 Maverick400B total, 17B activeApril 2025MoEOfficial release
GoogleT5 (all sizes)77M to 11B2019TransformerOfficial paper
LaMDA137B2021TransformerOfficial blog
PaLM540BApril 2022TransformerOfficial blog
PaLM 2340BMay 2023TransformerCNBC report
Gemini Nano1.8B, 3.25BDec 2023TransformerTechnical docs
DeepSeekDeepSeek-LLM7B, 67BDec 2023TransformerGitHub repo
DeepSeek-Coder1.3B, 6.7B, 33BNov 2023TransformerarXiv paper
DeepSeek-V2236B total, 21B activeMay 2024MoEarXiv paper
DeepSeek-V3671B total, 37B activeDec 2024MoEarXiv paper
DeepSeek-R1671B total, 37B activeJan 2025MoEGitHub repo
xAIGrok-1314BNov 2023MoE (8 experts)GitHub repo
MistralMistral 7B7.3BOct 2023TransformerOfficial blog
Mixtral 8x7B46.7B total, 12.9B activeJan 2024MoEOfficial blog
Mixtral 8x22B141B total, 39B activeApril 2024MoEOfficial blog
Codestral22BMay 2024TransformerOfficial docs
OthersAI21 Jurassic-17B, 178BAug 2021TransformerOfficial blog
Inflection-2175BNov 2023TransformerCompany announcement
Cohere Command R+104BMarch 2024TransformerHuggingFace
Alibaba Qwen series0.5B to 235B2023-2024Transformer/MoEGitHub repos
01.AI Yi series6B, 9B, 34BNov 2023TransformerHuggingFace
Stability StableLM1.6B to 7B2023-2024TransformerGitHub repos

Estimated Parameter Counts (Unconfirmed)

CompanyModelEstimated ParametersRelease DateSource of Estimate
AnthropicClaude 3 Haiku~20BMarch 2024Alan D. Thompson (AI researcher)
Claude 3 Sonnet~70BMarch 2024Alan D. Thompson
Claude 3 Opus~2TMarch 2024Alan D. Thompson
Claude 3.5 Sonnet>175BJune 2024Third-party analysis
OpenAIGPT-3.5-Turbo~20BNov 2022Speculation based on performance
GPT-4~1.7T totalMarch 2023Industry estimates
GPT-4o~200BMay 2024Third-party analysis
GPT-4o Mini~8BMay 2024Third-party analysis
o1-preview~300BSept 2024Third-party estimates
o1-mini~100BSept 2024Third-party estimates
GoogleGemini Ultra30-65T (speculated)Dec 2023Unconfirmed rumors

Models with Undisclosed Parameters

The following models have no reliable parameter count information available:

  • Anthropic: All Claude 1, 2, 3.5 v2, 3.7, and 4 models
  • OpenAI: GPT-4 Turbo, o3, o3-mini
  • Google: All Gemini models except Nano
  • xAI: All Grok models except Grok-1
  • Mistral: Mistral Large, Medium, Small
  • Others: Inflection-1, Cohere Command R, AI21 Jurassic-2

Notable Architectural Innovations

Mixture of Experts (MoE) Revolution

DeepSeek pioneered cost-efficient MoE architectures, with DeepSeek-V3 achieving GPT-4-level performance at 1/20th the training cost. Their 671B parameter model activates only 37B per token, demonstrating how architectural innovation can dramatically improve efficiency.

Context Length Evolution

While early models like GPT-2 supported 1,024 tokens, modern models reach unprecedented scales. Llama 4 Scout supports 10 million tokens, while DeepSeek and Gemini models commonly support 128K+ tokens. This expansion enables entirely new use cases for AI systems.

Multimodal Integration

The latest generation includes native multimodal capabilities. Llama 3.2 introduced vision models with 11B and 90B parameters, while Llama 4 builds multimodality into its core architecture. This trend reflects the industry's move beyond text-only models.

Cost and Efficiency Insights

Training costs vary dramatically. DeepSeek-V3's 671B parameter model cost approximately $5.6 million to train using 2.788M H800 GPU hours, while estimates suggest GPT-4 cost over $100 million. xAI's Grok-3 reportedly used 200,000 GPUs with an estimated $6-8 billion training cost, highlighting the resource intensity of frontier model development.

Industry Implications

The parameter count data reveals three critical industry trends. First, strategic opacity has become the norm, with leading companies treating model specifications as trade secrets. Second, efficiency over scale drives innovation, as companies achieve better performance with fewer parameters through architectural improvements. Third, open-source momentum continues, with Meta's Llama series and DeepSeek's models providing transparent alternatives to proprietary systems.

This comprehensive analysis demonstrates that while parameter counts provide useful benchmarks, they represent just one dimension of model capability. The future of AI development lies not in raw parameter scaling but in architectural innovation, training efficiency, and multimodal integration.

Content is user-generated and unverified.
    AI Model Parameter Counts: A Comprehensive Analysis | Claude