The landscape of large language models has evolved dramatically since 2018, with parameter counts ranging from millions to potentially trillions. This comprehensive research reveals significant patterns in how companies approach model transparency, with early models typically having confirmed parameter counts while newer models increasingly keep this information proprietary.
Transparency shifts across the industry. Companies like OpenAI, Google, and Anthropic have moved from openly sharing parameter counts in their early models to keeping this information confidential for competitive reasons. OpenAI disclosed full details for GPT-1 through GPT-3 but has remained silent on GPT-4 and newer models. Similarly, Google openly shared PaLM's 540B parameters but keeps Gemini specifications under wraps.
The rise of efficient architectures. Many companies now use Mixture of Experts (MoE) architectures that activate only a fraction of total parameters per token. DeepSeek-V3, for instance, has 671B total parameters but uses only 37B per token, achieving remarkable efficiency. This architectural innovation allows models to scale without proportionally increasing inference costs.
Parameter count isn't everything. Google's PaLM 2 demonstrates this principle, achieving better performance than its 540B-parameter predecessor with only 340B parameters. Companies increasingly focus on training efficiency, data quality, and architectural innovations rather than raw parameter count.
| Company | Model | Parameters | Release Date | Architecture | Source |
|---|---|---|---|---|---|
| OpenAI | GPT-1 | 117M | June 2018 | Transformer | Official paper |
| GPT-2 Small | 124M | Feb 2019 | Transformer | GitHub repo | |
| GPT-2 Medium | 355M | Feb 2019 | Transformer | GitHub repo | |
| GPT-2 Large | 774M | Feb 2019 | Transformer | GitHub repo | |
| GPT-2 XL | 1.5B | Feb 2019 | Transformer | GitHub repo | |
| GPT-3 Ada | ~350M | May 2020 | Transformer | EleutherAI analysis | |
| GPT-3 Babbage | ~1.3B | May 2020 | Transformer | EleutherAI analysis | |
| GPT-3 Curie | ~6.7B | May 2020 | Transformer | EleutherAI analysis | |
| GPT-3 Davinci | 175B | May 2020 | Transformer | Official paper | |
| Meta | Llama 1 (7B) | 7B | Feb 2023 | Transformer | Official release |
| Llama 1 (13B) | 13B | Feb 2023 | Transformer | Official release | |
| Llama 1 (30B) | 30B | Feb 2023 | Transformer | Official release | |
| Llama 1 (65B) | 65B | Feb 2023 | Transformer | Official release | |
| Llama 2 (7B) | 7B | July 2023 | Transformer | Official release | |
| Llama 2 (13B) | 13B | July 2023 | Transformer | Official release | |
| Llama 2 (70B) | 70B | July 2023 | Transformer | Official release | |
| Code Llama (all sizes) | 7B, 13B, 34B, 70B | Aug 2023-Jan 2024 | Transformer | Official release | |
| Llama 3 (8B, 70B) | 8B, 70B | April 2024 | Transformer | Official release | |
| Llama 3.1 (8B, 70B, 405B) | 8B, 70B, 405B | July 2024 | Transformer | Official release | |
| Llama 3.2 | 1B, 3B, 11B, 90B | Sept 2024 | Transformer/Vision | Official release | |
| Llama 3.3 | 70B | Dec 2024 | Transformer | Official release | |
| Llama 4 Scout | 109B total, 17B active | April 2025 | MoE | Official release | |
| Llama 4 Maverick | 400B total, 17B active | April 2025 | MoE | Official release | |
| T5 (all sizes) | 77M to 11B | 2019 | Transformer | Official paper | |
| LaMDA | 137B | 2021 | Transformer | Official blog | |
| PaLM | 540B | April 2022 | Transformer | Official blog | |
| PaLM 2 | 340B | May 2023 | Transformer | CNBC report | |
| Gemini Nano | 1.8B, 3.25B | Dec 2023 | Transformer | Technical docs | |
| DeepSeek | DeepSeek-LLM | 7B, 67B | Dec 2023 | Transformer | GitHub repo |
| DeepSeek-Coder | 1.3B, 6.7B, 33B | Nov 2023 | Transformer | arXiv paper | |
| DeepSeek-V2 | 236B total, 21B active | May 2024 | MoE | arXiv paper | |
| DeepSeek-V3 | 671B total, 37B active | Dec 2024 | MoE | arXiv paper | |
| DeepSeek-R1 | 671B total, 37B active | Jan 2025 | MoE | GitHub repo | |
| xAI | Grok-1 | 314B | Nov 2023 | MoE (8 experts) | GitHub repo |
| Mistral | Mistral 7B | 7.3B | Oct 2023 | Transformer | Official blog |
| Mixtral 8x7B | 46.7B total, 12.9B active | Jan 2024 | MoE | Official blog | |
| Mixtral 8x22B | 141B total, 39B active | April 2024 | MoE | Official blog | |
| Codestral | 22B | May 2024 | Transformer | Official docs | |
| Others | AI21 Jurassic-1 | 7B, 178B | Aug 2021 | Transformer | Official blog |
| Inflection-2 | 175B | Nov 2023 | Transformer | Company announcement | |
| Cohere Command R+ | 104B | March 2024 | Transformer | HuggingFace | |
| Alibaba Qwen series | 0.5B to 235B | 2023-2024 | Transformer/MoE | GitHub repos | |
| 01.AI Yi series | 6B, 9B, 34B | Nov 2023 | Transformer | HuggingFace | |
| Stability StableLM | 1.6B to 7B | 2023-2024 | Transformer | GitHub repos |
| Company | Model | Estimated Parameters | Release Date | Source of Estimate |
|---|---|---|---|---|
| Anthropic | Claude 3 Haiku | ~20B | March 2024 | Alan D. Thompson (AI researcher) |
| Claude 3 Sonnet | ~70B | March 2024 | Alan D. Thompson | |
| Claude 3 Opus | ~2T | March 2024 | Alan D. Thompson | |
| Claude 3.5 Sonnet | >175B | June 2024 | Third-party analysis | |
| OpenAI | GPT-3.5-Turbo | ~20B | Nov 2022 | Speculation based on performance |
| GPT-4 | ~1.7T total | March 2023 | Industry estimates | |
| GPT-4o | ~200B | May 2024 | Third-party analysis | |
| GPT-4o Mini | ~8B | May 2024 | Third-party analysis | |
| o1-preview | ~300B | Sept 2024 | Third-party estimates | |
| o1-mini | ~100B | Sept 2024 | Third-party estimates | |
| Gemini Ultra | 30-65T (speculated) | Dec 2023 | Unconfirmed rumors |
The following models have no reliable parameter count information available:
DeepSeek pioneered cost-efficient MoE architectures, with DeepSeek-V3 achieving GPT-4-level performance at 1/20th the training cost. Their 671B parameter model activates only 37B per token, demonstrating how architectural innovation can dramatically improve efficiency.
While early models like GPT-2 supported 1,024 tokens, modern models reach unprecedented scales. Llama 4 Scout supports 10 million tokens, while DeepSeek and Gemini models commonly support 128K+ tokens. This expansion enables entirely new use cases for AI systems.
The latest generation includes native multimodal capabilities. Llama 3.2 introduced vision models with 11B and 90B parameters, while Llama 4 builds multimodality into its core architecture. This trend reflects the industry's move beyond text-only models.
Training costs vary dramatically. DeepSeek-V3's 671B parameter model cost approximately $5.6 million to train using 2.788M H800 GPU hours, while estimates suggest GPT-4 cost over $100 million. xAI's Grok-3 reportedly used 200,000 GPUs with an estimated $6-8 billion training cost, highlighting the resource intensity of frontier model development.
The parameter count data reveals three critical industry trends. First, strategic opacity has become the norm, with leading companies treating model specifications as trade secrets. Second, efficiency over scale drives innovation, as companies achieve better performance with fewer parameters through architectural improvements. Third, open-source momentum continues, with Meta's Llama series and DeepSeek's models providing transparent alternatives to proprietary systems.
This comprehensive analysis demonstrates that while parameter counts provide useful benchmarks, they represent just one dimension of model capability. The future of AI development lies not in raw parameter scaling but in architectural innovation, training efficiency, and multimodal integration.