There are several things that we really like about Levanter:
Haliax, the named tensor module in Levanter, makes DL code easier to read, compose, and debug than positional axes. I found that I no longer need detailed comments to interpret the reshape and broadcast operations in matrix computation.
Levanter offers FSDP and Tensor Parallelism to train LMs at scale. We achieved up to 54% Model Flop Utilization and 77.1% Hardware Flop Utilization on TPUs, which matches the state-of-the-art performance of Google’s MaxText, MosaicML, and Megatron.
Training jobs on Levanter achieve perfect bit-wise reproducibility. You can achieve the exact loss curve with the same configurations. Say goodbye to non-deterministic debugging with DL.
Levanter has very neat features like live visualization for text data, distributed data preprocessing with Ray, and tight integration with W&B.
It is my great pleasure to work closely with David Hall, Percy Liang, and other amazing colleagues on Levanter at Stanford CRFM. I have used Levanter to train multiple large-scale language models on TPU. It is delightful to use and powerful for its purpose.
Levanter is now open-source on Github here under Apache 2.0. I hope it will be useful for the community to train LLMs and that it will continue to evolve in future versions!