😳 Some interesting leaks on OpenAI GPT-4, a LLM behind ChatGPT
🔸 GPT-4, the latest model from OpenAI, boasts an impressive ~1.8 trillion parameters, a substantial increase from its predecessor, GPT-3. This model utilizes a Mixture of Experts (MoE) model with 16 experts, a relatively simple routing method that only uses ~280B parameters per forward pass, compared to the ~1.8 trillion in a dense model.
🔸 The training process for GPT-4 involved ~13T tokens, with two epochs for text and four for code. The instruction fine-tuning data was sourced from both ScaleAI and internal resources. Pre-training consisted of an 8k sequence length (seqlen), with a 32k seqlen version being a fine-tuned version of the 8k.
🔸 During the training process, the batch size was incrementally increased to 60 million. The actual batch size is calculated by dividing by the sequence length. The model employed 8-way tensor parallelism for A100s GPUs, and beyond that, 15-way pipeline parallelism.
🔸 The training cost was estimated at ~2.15e25 FLOPs, utilizing ~25,000 A100s for 90 to 100 days, resulting in a cost of approximately $63 million. Currently, pre-training can be accomplished with ~8,192 H100 in ~55 days for $21.5 million.
🔸 The use of the MoE model does present certain trade-offs and can be challenging to manage at inference. While more experts could potentially improve the model, it's difficult to generalize tasks and achieve convergence, leading OpenAI to opt for 16 experts.
🔸 The inference cost for GPT-4 is three times that of the 175B parameter Davinchi, mainly due to larger clusters and lower utilization. The estimated cost is $0.0049/1k tokens for 128 A100s and $0.0021/1k tokens for 128 H100’s, assuming high utilization and a large batch size.
🔸 OpenAI, like many others, employs MQA to decrease memory requirements for the KV cache. However, the 32k sequence length GPT-4 cannot run on 40GB A100s. To ensure efficient inference, OpenAI uses variable batch sizes and continuous batching.
🔸 The Vision Multi-Modal is a separate encoder with cross-attention, similar to Flamingo, adding parameters to GPT-4's 1.8T. It is fine-tuned with ~2 trillion tokens post text training and is employed for autonomous web-page reading and image transcription.
🔸 OpenAI is considering the use of speculative decoding on GPT-4's inference, using a smaller model to predict tokens, which are then confirmed by a larger model. A decline in GPT-4 quality might be due to the acceptance of lower probability sequences.
🔸 Inference is performed on a 128 GPU cluster, in multiple datacenters, using 8-way tensor parallelism and 16-way pipeline parallelism. Each node of 8 GPUs holds ~130B parameters. Based on this, OpenAI should have trained on twice the number of tokens for optimal output, indicating challenges in obtaining high-quality data.
🔸 GPT-4 was trained on 13T tokens, with CommonCrawl & RefinedWeb contributing 5T each. The remaining data is speculated to come from "secret" sources like Twitter, Reddit, and YouTube, with possible contributions from datasets like LibGen, Sci-Hub, GitHub, and potentially a custom dataset of college textbooks.
🔸 GPT-4's knowledge in the user's field is likely due to this textbook data. Efforts to extract memorized book parts from GPT-4 suggest specific books it's likely seen. It even appears to recognize unique ids of Project Euler exercises.
Source
@ppprompt
🔸 GPT-4, the latest model from OpenAI, boasts an impressive ~1.8 trillion parameters, a substantial increase from its predecessor, GPT-3. This model utilizes a Mixture of Experts (MoE) model with 16 experts, a relatively simple routing method that only uses ~280B parameters per forward pass, compared to the ~1.8 trillion in a dense model.
🔸 The training process for GPT-4 involved ~13T tokens, with two epochs for text and four for code. The instruction fine-tuning data was sourced from both ScaleAI and internal resources. Pre-training consisted of an 8k sequence length (seqlen), with a 32k seqlen version being a fine-tuned version of the 8k.
🔸 During the training process, the batch size was incrementally increased to 60 million. The actual batch size is calculated by dividing by the sequence length. The model employed 8-way tensor parallelism for A100s GPUs, and beyond that, 15-way pipeline parallelism.
🔸 The training cost was estimated at ~2.15e25 FLOPs, utilizing ~25,000 A100s for 90 to 100 days, resulting in a cost of approximately $63 million. Currently, pre-training can be accomplished with ~8,192 H100 in ~55 days for $21.5 million.
🔸 The use of the MoE model does present certain trade-offs and can be challenging to manage at inference. While more experts could potentially improve the model, it's difficult to generalize tasks and achieve convergence, leading OpenAI to opt for 16 experts.
🔸 The inference cost for GPT-4 is three times that of the 175B parameter Davinchi, mainly due to larger clusters and lower utilization. The estimated cost is $0.0049/1k tokens for 128 A100s and $0.0021/1k tokens for 128 H100’s, assuming high utilization and a large batch size.
🔸 OpenAI, like many others, employs MQA to decrease memory requirements for the KV cache. However, the 32k sequence length GPT-4 cannot run on 40GB A100s. To ensure efficient inference, OpenAI uses variable batch sizes and continuous batching.
🔸 The Vision Multi-Modal is a separate encoder with cross-attention, similar to Flamingo, adding parameters to GPT-4's 1.8T. It is fine-tuned with ~2 trillion tokens post text training and is employed for autonomous web-page reading and image transcription.
🔸 OpenAI is considering the use of speculative decoding on GPT-4's inference, using a smaller model to predict tokens, which are then confirmed by a larger model. A decline in GPT-4 quality might be due to the acceptance of lower probability sequences.
🔸 Inference is performed on a 128 GPU cluster, in multiple datacenters, using 8-way tensor parallelism and 16-way pipeline parallelism. Each node of 8 GPUs holds ~130B parameters. Based on this, OpenAI should have trained on twice the number of tokens for optimal output, indicating challenges in obtaining high-quality data.
🔸 GPT-4 was trained on 13T tokens, with CommonCrawl & RefinedWeb contributing 5T each. The remaining data is speculated to come from "secret" sources like Twitter, Reddit, and YouTube, with possible contributions from datasets like LibGen, Sci-Hub, GitHub, and potentially a custom dataset of college textbooks.
🔸 GPT-4's knowledge in the user's field is likely due to this textbook data. Efforts to extract memorized book parts from GPT-4 suggest specific books it's likely seen. It even appears to recognize unique ids of Project Euler exercises.
Source
@ppprompt