RWKV: Reinventing RNNs for the Transformer Era — with Eugene Cheah of UIlicious

Latent Space: The AI Engineer Podcast ·

The AI Engineer Summit Expo has been announced, presented by AutoGPT (and future guest Toran Bruce-Richards!) Stay tuned for more updates on the Summit livestream and Latent Space University. This post was on HN for 10 hours. What comes after the Transformer? This is one of the Top 10 Open Challenges in LLM Research (https://huyenchip.com/2023/08/16/llm-research-open-challenges.html) that has been the talk of the AI community this month. Jon Frankle (friend of the show (https://www.latent.space/p/mosaic-mpt-7b)!) has an ongoing bet (https://www.isattentionallyouneed.com/) with Sasha Rush on whether Attention is All You Need, and the most significant challenger to emerge this year has been RWKV - Receptance Weighted Key Value models (https://huggingface.co/blog/rwkv), which revive the RNN for GPT-class LLMs, inspired by a 2021 paper on Attention Free Transformers (https://arxiv.org/abs/2105.14103) from Apple (surprise!). What this means practically is that RWKV models tend to scale in all directions (both in training and inference) much better than Transformers-based open source models: While remaining competitive on standard reasoning benchmarks: swyx was recently in Singapore for meetings with AI government and industry folks (https://aisingapore.org/), and grabbed 2 hours with RWKV committee member Eugene Cheah for a deep dive, the full recording of which is now up on Latent Space TV (https://youtu.be/dvk6X5zeIfY): Today we release both the 2hr video and an edited 1hr audio version, to cater to the different audiences and provide “ablation opportunities” on RWKV interest level.

The Eleuther Mafia? The RWKV project is notable not merely because of the credible challenge to the Transformers dominance. It is also a distributed, international, mostly uncredentialed community reminiscent of early 2020s Eleuther AI:

Audio Version Timestamps assisted by smol-podcaster. Different timestamps vs the 2hr YouTube

  • [00:05:35] Eugene's path into AI at UIlicious
  • [00:07:33] Tokenizer penalty and data efficiency of Transformers
  • [00:08:02] Using Salesforce CodeGen
  • [00:10:17] The limitations of Transformers for handling large context sizes
  • [00:13:17] RWKV compute costs compared to Transformers
  • [00:16:06] How Eugene found RWKV early
  • [00:18:52] RWKV's focus on supporting many languages, not just English
  • [00:21:24] Using the RWKV model for fine-tuning for specific languages
  • [00:24:45] What is RWKV?
  • [00:33:46] Overview of the different RWKV models like World, Raven, Novel
  • [00:41:34] Background of Blink, the creator of RWKV
  • [00:49:55] The linear vs quadratic scaling of RWKV vs Transformers
  • [00:53:29] RWKV matching Transformer performance on reasoning tasks
  • [00:54:31] The community's lack of marketing for RWKV
  • [00:57:00] The English-language bias in AI models
  • [01:00:33] Plans to improve RWKV's memory and context handling
  • [01:03:10] Advice for AI engineers wanting to get more technical knowledge

Show Notes Companies/Organizations:

Misc Notes RWKV is not without known weaknesses - Transformers do well in reasoning because they are expressive in the forward pass (https://twitter.com/karpathy/status/1593417989830848512?s=20), yet the RWKV docs already note that it is sensitive to prompt formatting and poor at lookback tasks (https://wiki.rwkv.com/#tldr-vs-existing-transformer-models). We also asked pointed questions about RWKV’s challenges in the full podcast.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe (https://www.latent.space/subscribe?utm_medium=podcast&utm_campaign=CTA_2)