

Livestreams for the AI Engineer World’s Fair (Multimodality ft. the new GPT-4o demo, GPUs and Inference (ft. Cognition/Devin), CodeGen, Open Models tracks) are now live! Subscribe to @aidotEngineer to get notifications of the other workshops and tracks!
It’s easy to get de-sensitized to new models topping leaderboards every other week — however, the top of the LMsys leaderboard has typically been the exclusive domain of very large, very very well funded model labs like OpenAI, Anthropic, Google, and Meta. OpenAI had about 600 people at the time of GPT-4, and Google Gemini had 950 co-authors. This is why Reka Core made waves in May - not only debuting at #7 on the leaderboard, but doing so with all-new GPU infrastructure and 20 employees with <5 people on pre-training and a relatively puny $60m in funding.
Shortly after the release of GPT3, Sam Altman speculated on the qualities of “10,000x researchers”:
* “They spend a lot of time reflecting on some version of the Hamming question—"what are the most important problems in your field, and why aren’t you working on them?” In general, no one reflects on this question enough, but the best people do it the most, and have the best ‘problem taste’, which is some combination of learning to think independently, reason about the future, and identify attack vectors.” — sama
* Taste is something both John Schulman and Yi Tay emphasize greatly
* “They have a laser focus on the next step in front of them combined with long-term vision.” — sama
* “They are extremely persistent and willing to work hard… They have a bias towards action and trying things, and they’re clear-eyed and honest about what is working and what isn’t” — sama
“There's a certain level of sacrifice to be an AI researcher, especially if you're training at LLMs, because you cannot really be detached… your jobs could die on a Saturday at 4am, and there are people who will just leave it dead until Monday morning, or there will be people who will crawl out of bed at 4am to restart the job, or check the TensorBoard” – Yi Tay (at 28 mins)
“I think the productivity hack that I have is, I didn't have a boundary between my life and my work for a long time. So I think I just cared a lot about working most of the time. Actually, during my PhD, Google and everything [else], I'll be just working all the time. It's not like the most healthy thing, like ever, but I think that that was actually like one of the biggest, like, productivity, like and I spent, like, I like to spend a lot of time, like, writing code and I just enjoy running experiments, writing code” — Yi Tay (at 90 mins)
* See @YiTayML example for honest alpha on what is/is not working
and so on.
More recently, Yi’s frequent co-author, Jason Wei, wrote about the existence of Yolo researchers he witnessed at OpenAI:
Given the very aggressive timeline — Yi left Google in April 2023, was GPU constrained until December 2023, and then Reka Flash (21B) was released in Feb 2024, and Reka Core (??B) was released in April 2024 — Reka’s 3-5 person pretraining team had no other choice but to do Yolo runs. Per Yi:
“Scaling models systematically generally requires one to go from small to large in a principled way, i.e., run experiments in multiple phrases (1B->8B->64B->300B etc) and pick the winners and continuously scale them up. In a startup, we had way less compute to perform these massive sweeps to check hparams. In the end, we had to work with many Yolo runs (that fortunately turned out well).
*In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. *In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut feeling and instinct.**”
We were excited to be the first podcast to interview Yi, and recommend reading our extensive show notes to follow the same papers we reference throughout the conversation.
Special thanks to Terence Lee of TechInAsia for the final interview clip, who are launching their own AI newsletter called The Prompt!
Full Video Podcast
Show Notes
* Yi on LinkedIn, Twitter, Personal
* Building frontier AI teams as GPU Poors
* Yi’s Research
* 2020
* Efficient Transformers: A Survey went viral!
* Long Range Arena: A Benchmark for Efficient Transformers in 2020
* 2021: Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
* 2022:
* UL2: Unifying Language Learning Paradigms
* Emergent Abilities of Large Language Models vs the Mirage paper
* Recitation Augmented generation
* DSI++: Updating Transformer Memory with New Documents
* The Efficiency Misnomer: “a model with low FLOPs may not actually be fast, given that FLOPs does not take into account information such as degree of parallelism (e.g., depth, recurrence) or hardware-related details like the cost of a memory access”
* 2023: Flan-{PaLM/UL2/T5}
1.8k tasks for instruction tuning
* Encoder-decoder vs Decoder only
* Latent Space Discord discussion on enc-dec vs dec-only
* Related convo with Yi Tay vs Yann LeCun
* @teortaxes:
* If 2024 papers are to be trusted: You don't need (most) attention you don't need (most) kv cache You don't need (most) FFN layers You don't need a reward model You don't need… all the stuff that still makes frontier models work, ironically
* “there have been no real advance since 2019's T5 models”
* The future of Open source models - relevant to a16z vs Founders Fund debate. Open source cannot compete!
Timestamps
* [00:00:00] Intro
* [00:01:57] Yi Tay Intro
* [00:03:02] Path into LLMs
* [00:09:41] Google Brain: PaLM, UL2, DSI, Emergent Abilities
* [00:11:54] PaLM 2
* [00:15:27] Emergent Abilities
* [00:18:26] Quoc Le
* [00:24:16] Marketing Research: How to Start from Zero with No Reach
* [00:27:34] What's needed to be a successful AI Researcher?
* [00:30:31] Reka Origin
* [00:33:24] Starting Reka Infra
* [00:35:04] Why not to use TPUs outside Google
* [00:36:29] Chaotic vs Stable Infra
* [00:38:04] Risk Sharing of Bad Nodes
* [00:41:05] Checkpointing and Orchestration
* [00:43:39] Reka Flash/Core/Edge
* [00:46:59] Recruiting the team
* [00:47:22] Noam Architecture - Swiglu, GQA, RMSnorm, ROPE
* [00:52:26] Encoder-decoder vs Decoder-only
* [00:55:52] LLM Trends - Llama 3 and Phi 3 Glowup
* [00:57:46] LLM Trends - Benchmarks and Evals
* [01:03:25] LLM Trends - Early vs Late Fusion Multimodality
* [01:07:22] LLM Trends - Scaling Laws
* [01:09:41] LLM Trends - Long Context vs RAG
* [01:12:31] Long Context vs Finetuning
* [01:14:14] If emergence is real, when does Efficiency work?
* [01:17:41] MoEs and Upcycling
* [01:20:47] The Efficiency Misnomer - Efficiency != Speed
* [01:25:05] Open Source vs Closed Models
* [01:28:08] Personal Productivity
* [01:33:19] Singapore vs US Academic Scene
* [01:37:42] Building Silicon Valley outside Silicon Valley
* [01:40:29] TechInAsia Meetup
Transcript
[00:00:00] swyx: Thanks for watching. Bye bye.
[00:00:05] AI Charlie: Welcome back, friends. It's only been a week since the World's Fair, and it was incredible gathering the community to see the latest and greatest in AI engineering. You can catch up now on the four live stream track days on the AI Engineer YouTube, and our team is busy editing the remaining workshops and five other tracks, including the surprisingly popular AI Leadership track.
[00:00:28] Thank you all for your support, and stay tuned for news about the next event. The 2025 AI Engineer Summit. Last week, we did a very special deep dive with Josh and John of InView and Databricks Mosaic on training LLMs and setting up massive GPU clusters. And today, we're pleased to follow that up with a very special conversation with Yi Tai, formerly tech lead of Palm 2 at Google Brain, and now chief scientist of Reka.
[00:00:56] ai. Raker's largest model, Raker Core, was at launch. The fifth best model in the world. And the only GPT 4 class model not trained by a big lab like OpenAI, Google, Anthropic or Meta. In fact, while Google Gemini has 950 co authors, Raker only has 20 employees. With up to five people actually working on pre training.
[00:01:21] Swyx was excited to return to Singapore to delve into Yi Reka and building a new AI model lab outside of Silicon Valley. Stay tuned to the very end for a special bonus clip from Yi's recent appearance at the Tekinesia meetup for his spiciest take on why senior management is overrated and why this is the time to build up senior 10, 000x individual contributors like himself.
[00:01:46] Watch out and take care.
[00:01:48] swyx: Welcome to lay space. This is a long time coming, but I'm so excited to have you here.
[00:01:52] Yi Tay: Yeah, thanks for, thanks for inviting and excited to be here. chat about a lot of stuff.
[00:01:57] Yi Tay Intro
[00:01:57] swyx: Yeah. So you are interesting to research and introduce. You are now chief scientist of Rega, which is a super interesting model lab, but before that you were at Google Brain, you were architecture co-lead on POM two, you were inventor of UL two.
[00:02:10] You're a core contributor on Flan, you're a member of the Bard core team, and you also did some work on generative retrieval. That's a very, very illustrious three year career at Google Brain.
[00:02:19] Yi Tay: Yeah, thanks, thanks, thanks, yeah.
[00:02:20] swyx: And then since then, Reka, you joined in March 2023, announced a 58 million series in June 2023.
[00:02:26] I don't know if you know, the post money valuation, or the pre money valuation is public. So it's, crunch basis is, is, Oh, okay, okay. I
[00:02:33] Yi Tay: did not know that yet. 50
[00:02:34] swyx: something million. So you don't even have to leak it. It's on the internet. Okay. Rekha's stated goals were to work on universal intelligence, including general purpose multimodal and multilingual agents, self improving AI, and model efficiency.
[00:02:45] In February You released Rekha Flash. In April, you released Rekha Core and Edge. And then, most recently, you released VibeEval. Is that a good summary of the last six years? No, it's not. Four years? Four years, yeah. Oh my god. We're talking about AI I was wondering, since when did I,
[00:03:00] Yi Tay: like, step into a time machine or something?
[00:03:02] Path into LLMs
[00:03:02] swyx: Yeah, okay, so can we just talk about your transition into, you know, you did your PhD, and we can talk about your PhD, transition into brain and research and all that. You know, I saw you do some work on recommender systems, I saw you do some work on quaternions. What the f**k was that?
[00:03:17] Yi Tay: Okay, let, let, let's, let's forget about
[00:03:18] swyx: that.
[00:03:18] Just describe your path into modern L lms, right? Because you were, you were, you didn't start there.
[00:03:24] Yi Tay: Yeah. Okay. Sure. Sure. I, I, I think the world also didn't start start there, right? I mean, I think in so I joined Google in 2019, end of 2019. And the world looked like really different at the time, right?
[00:03:34] I think that was around the time the first GBT was released by. GPT 1 or something was released by OpenAI. So, research, like ML research and NLP research looked very different at that time. So I was mostly, I identified as like a language researcher. I don't like to use the word NLP, Jason will kill me if I use the word NLP.
[00:03:51] But like, I was like, okay, a language researcher. I, , but I was more like an architecture kind of researcher. And when I joined Google, I was also I continued on this as a model architecture researcher. I worked a lot on efficient transformers. That was your first viral paper. Yeah, yeah, and like, you know, I worked in the long range arena.
[00:04:09] I spent quite a lot of time looking at what we could do without attention. Like, there was a synthesizer paper back in 2020. I think that was like my early days in Google. There wasn't like a At that point of time transformer research was mainly like WMT, like machine translation and like perplexity and stuff like that.
[00:04:25] It's not really about You know, there wasn't like, I think it was in field short, field short learning and field short in context learning came only about like, you know, when GPT 3 came out and beyond, right? And then, so I think that at that time, the meta, I would say, the meta looked very different. And at that time, a lot of the work was focused on Like fine tuning things like T5 or BERT or something like that, right?
[00:04:45] So I think a lot of the research, not only myself, but like around me or like even the broader community were working on those kind of things. And so I think that was, which I feel that in hindsight today is actually pretty useful to like kind of think about because a lot of people came into like, AI into right after ChatGPT came out, right, so they saw AI as kind of, I think there's a lot of benefits of you know, understanding how, you know, transformers and like, I've broken this thing apart so many times, trying to, it's like, these things actually, you know, help to improve intuition and it's not totally disconnect I think a lot of things are still relevant today and, and it's just the scale has gotten much larger and also the paradigms shift a little bit from Single task, fine tuning to like generally do everything kind of universal foundation models.
[00:05:29] Foundation models, right. I think it's just a slight change in paradigm, but fundamentally, I don't think like the underlying principles of research hasn't really changed that much except for like compute. Yeah. So basically algorithms
[00:05:42] swyx: stay put and then compute and data scale.
[00:05:45] Yi Tay: So I have some thoughts about this.
[00:05:47] So I think back then a lot of the academic research, I think people have talked about this, like Sasha Rush has talked about this, or other people have talked about this, it's like, the conferences were always organized by like, Applications, right? They were always organized by like, Oh, like question answering.
[00:06:02] It was always by this, right? I think there was, there's like a bit of a transpose going on. Things become universal and then becoming like, okay, there's a data work stream, there's a model architecture work stream, and then people work on improving like a universal model and general purpose algorithms to improve this model rather than finding domain specific tricks.
[00:06:20] I think for, even in 2019, I think I've already been Like focusing on works that are like you know, you could improve on general architecture at that time. It was like, like maybe LSTMs in 2017 or something, and then you try on like 10 different tasks and the kind of thing, right? But like a lot of the research community have been focused more on like, how do I get that extra 2 percent on question answering or like, and then sentiment analysis.
[00:06:44] I think. There was this phrase of like, in 2017, 2018, where this style of work was still very fashionable in academia and conferences, right? And then, I think the big thing about the chat GPT moment of like, 2022, the thing that changed drastically is like, it completely like, it was like this sharp, Make all this work like kind of like
[00:07:02] swyx: obsolete.
[00:07:03] So November 2022, you're saying? Exactly, Charged GPT launch? Because I feel like if you're in the research community, this was coming.
[00:07:08] Yi Tay: Yeah, yeah. That's what I'm saying. I'm saying that like the big labs and stuff, like people have already been moving towards general, like even T5 was already like general purpose.
[00:07:15] Yeah. And that's the thing, right? But like, there was, it's like, there's a bit of a time okay, like places like Google and Meta, OpenAI, we will be working on things like three years ahead of everybody else. And academia will be like, Still working on like this path specific things, Got it, got it. And then like, I think the faulty function was the, the ChatGPT moment actually really like, It was coming, it was coming, it was just like the final, the last straw, and then it's finally like, Yeah,
[00:07:39] swyx: now it's serious.
[00:07:40] Yi Tay: Yeah, now it's really, the thing completely changed. I don't know how it turned from my, from my background to like, talking about the meta.
[00:07:47] swyx: I think that you navigate the meta very well, and part of my goal here is to also isolate how you think about the meta for other people to reflect on, because I think obviously you do it very well.
[00:07:57] Oh, thanks. I'm looking at your papers published somewhere around 2021. You had a hard cut to 22 Y two and Palm, and you did Y two Palm Emerge Abilities, DSI, REIT recitation, augmented Generation, all in the same year-ish. Mm-Hmm. So like there was, did you change teams? Did you, did you like have a research focus?
[00:08:17] Like when did you become,
[00:08:19] Yi Tay: oh, you're still saying that like language research became the
[00:08:21] swyx: model guy.
[00:08:21] Yi Tay: My research became emergent. It was like, it's very obvious. No, I don't think I'm like a person that like, I'm not like super, super great at like forcing a trend, like two years ahead. And then especially, especially like, Plan for that, right?
[00:08:34] Yeah. I think I smoothly and as like, kind of like as like I, I never actually really thought about this, this way. I just did like at every step, I just like optimized for like what I found to be most impactful and most promising. And then that gradually, and also it is, it is also a lot of influence by talking to people.
[00:08:52] Right? And then at the time I started working more with. I had some close collaborations with Jason and other people. I mean, Google is, you can work with anybody you want, basically. So you're kind of like, also like, partly it's like the environment shift. And I think the environment shifts like very quickly, but like, I was always like pulling for like the environment.
[00:09:10] I was not like, I think it's always good to like have an open mind and move along with the field rather than, okay, this is my research area. I'm going to get stuck in it two years. I think I just move along to find like things that interest me. And naturally I think like that turned out to be like, The things that were most impactful at that time.
[00:09:27] In retrospect, I kind of did well, but like, I never actually really saw it as intentional. Sure. I didn't do anything really intentional, except that's doing what I find interesting, actually.
[00:09:37] swyx: Cool. Well, we'll just talk about the main work at Google Brain, and then we'll move to Rekha.
[00:09:41] Google Brain: PaLM, UL2, DSI, Emergent Abilities
[00:09:41] swyx: So out of UL2, Palm, Emergent Abilities, which of these came first?
[00:09:46] Yi Tay: Yeah, I, wait, I did, I need, I can't really actually re remember. Okay. What will make you talk about year two then? Year two and DSI, the differential search index? I I was working on it like the December of 2021. So like at Google they were like projects that are like big efforts that are like a researcher will be like part of the effort and then this will be kind of top downish to some extent.
[00:10:04] Right. And then they are, they were also like. Bottom up research that one could do I can't speak for the Google now for sure, but like, at least at that time, right? So UL2 and DSI, Differentiable Search Index, were like, works that I kind of tinkered with in the December break where nobody was around.
[00:10:19] Palm also has this kind of differentiation because there's Palm 1 and there's Palm 2. Right. So Palm 2, I was actually like the co lead of one of the work streams, but like Palm 1, I was more of a contributor and Palm 2, I was like, so, so they were like, now I have to think back of like, okay, what's the timeline, which came first, right?
[00:10:35] In general, they were like three categories of works. One is like broader efforts that are efforts. And then there are some that like UL2 and DSI were like my own projects, like projects I use to compute that. That I had, and then I just played with it. You accidentally left the auto run in for a month.
[00:10:50] Yeah, yeah, yeah, that was in the paper. It was fun, I think. It was really fun. And then, there was also like a third category where those were like, the efforts that my good friends were driving and I contributed. So Flan was just one of them. I know like, maybe on, I would like to just maybe say this publicly, like a lot of people like, I talk a lot about Flan.
[00:11:08] You're Flan's show number one. But like, yeah, but like, the first author is actually Hsiung Wan, who is great, and then like, another guy, Le, I was her cook. core contributor but I mean just because I'm a little more visible so I kind of Accidentally took a little bit more credit for that, but I was a co contributor, but I was not like The lead authors are obvious.
[00:11:25] Yeah, so the third category was projects that my friends Emergence was also like Emergence Abilities No, actually, that paper was supposed to be only me and Jason on the paper. And I actually became friends with Jason From the paper and then that led to like this streak of like, I dunno, 10 papers or something together with Jason and now we are like super good friends, the Ultimate Romance.
[00:11:44] But that was like the immersion paper. But I, emergent paper was also like a belonged to be like a, a bottom up kind of like a thing. And fun times. Yeah, it was fun. ,
[00:11:54] PaLM 2
[00:11:54] swyx: maybe I'll pick on Palm two. Because I feel like, I'll pick on Palm 2 and Emergence, because I really want to make sure I tell those stories.
[00:12:01] Those are important stories. Palm 2, I think it's a career story. It effectively became a co lead on the second version of a very high profile, company wide effort. How did that happen? I think people would like to know, to know how to, you know, what's like the sort of career strategy there.
[00:12:16] Yi Tay: To be clear I was one of the co leads, but there were a lot of co leads, so, so I don't want to take too much credit for that.
[00:12:21] But my involvement with Palm 2 came from the after UL2 was working well, and then it was getting some visibility within Google. Was UL2 the largest model that Google had released
[00:12:32] swyx: at the time? Yeah, I think so. That was the largest. And you just, it was a personal project? It was a personal project.
[00:12:37] Yeah. Yeah. Isn't that unusual? How can it be like one person's decision to like suddenly release something that, you know, effectively changed the trajectory of, I think how, how, how, how people brain, how,
[00:12:47] Yi Tay: how we work was that, I mean, 20 B is not that much larger, but from 11 B to 11 B T five, actually at that time there was starting BT five, right?
[00:12:55] So I think UL two is code decoder 20 B model. I think when we got it approved, it was like. It was released as like, kind of like, like the big brother of T5, you know? Kind of like, okay, we updated T5 with like a new objective and train this new model into DBM we want to, and it uses the same pre training data set and everything, right?
[00:13:13] So like from PRC4. Yeah, from, yeah, that was the easiest because there was precedence, right? It was like, okay.
[00:13:18] swyx: But yeah, there was some architecture, like the mixture of denoisers. Yeah,
[00:13:21] Yi Tay: yeah, yeah. So, so back to Palm two, I think my involvement with Palm Two came from the work to, to, to, to add UL two, to Palm two.
[00:13:28] And then I, I, I mean, it was from the top down point of view. I, I mean, the leads were decided in a top down manner. It's not like, like there was not much like fighting or like, or any major things, right? It was like. It was a mixture of bottom up, top down ish, half half situation, and then from the top it was like, Okay, these are the people who are the most visible in contributing to this workstream, and then, okay, how about E and this other guy will be in charge of this modeling workstream, and something like that, right?
[00:13:58] So I think it just happened that way organically, and yeah, I think that was how I kind of was co leading the modeling workstream.
[00:14:07] swyx: I think in retrospect, you understand now that this is a very valuable experience. And I think now, today, it will be much more competitive to get the job that you got, whereas you didn't, you know, two years ago, you didn't have to try that hard to get it.
[00:14:20] Or like, you kind of lucked into it with you all too, and then like, it just compounded from that initial good decision.
[00:14:25] Yi Tay: I think it's very hard to counterfactually analyze these type of things. I think it's definitely true that there are more people working on generative AI now, and, you know, if you are in a big company, it's way harder to navigate.
[00:14:35] Like these type of things, right? I wouldn't say that there were like nobody or so wanting to work on this at that time. In fact, there were actually But you were the obvious choice. There were less people. There were definitely less people, but I think I would say that maybe it's slightly harder now, but like, it's also not like it was easy at the time.
[00:14:50] Yeah.
[00:14:51] swyx: Yeah. I imagine it's sensitive. But also in my mind this is now the most And this is the most valuable on the job training in the world. And so people want to know how to get it. This is what I'm trying to figure out.
[00:15:03] Yi Tay: Like, actually, individually we also cannot pick somebody else's experience and then try to replicate it on, because everybody's circumstances, their initialization point, their, That thing is kind of also like in different This is not only true for LLMs in general, right?
[00:15:16] Because a lot of times like, oh, okay, you did this in this position, and then because of this It's very hard to trace all this down to to find the causal path for this thing. So I think everything in life, there's some luck involved, I guess. Yeah,
[00:15:26] swyx: there is.
[00:15:27] Emergent Abilities
[00:15:27] swyx: Emergent Abilities. Very influential paper.
[00:15:30] Subsequently contested by the Mirage paper. Oh, yeah, yeah. So before we get to the Mirage, was there a story behind Emergent Abilities? You know, I'm sure it's Jason's Thesis or like what? Just tell, just tell more about like the behind the scenes, like was was, was there a discussion that led to
[00:15:43] Yi Tay: it that, you know, this one was like this, the idea, the inception of it was like mostly Jason.
[00:15:49] Okay. Right. I think I, I helped out to like. You know, shape up a little bit of the paper get some stakeholders involved and stuff. I was discussing quite a bit with Jason, but this, the idea itself was Jason itself. So, actually, when the Mirage thing and everything came out I didn't okay, I was just hot takes for the sake of hot takes.
[00:16:06] I didn't feel, I believe in emergence I have to just go on the record and just say I mean, I believe in emergence. And then I was not feeling very strongly because I think that, I can't speak for Jason, but I would just imagine that he would be Maybe personally offended because because I know Jason is a person that takes a lot of like feedback like very well He's a very like he's not offended by harsh feedback and he rebuts well like online as well, right?
[00:16:29] But like I would imagine he will be the one that is the most like affected by Criticisms of emergence. I was believing in it, but I have to say that the paper, I mean, that's why he's the first author and I'm second, but that was mostly Jason's thesis, and I have to really say that Jason has really good ideas, and I was more of a support role for that paper, yeah.
[00:16:49] swyx: Sure, yeah, you know, lots more to discuss there, but you believe in emergence, that's enough for me to work with.
[00:16:55] Yi Tay: I also think that the, the, the Mirage paper is mostly like I don't know who, actually I don't even remember who wrote it. Ryland
[00:17:01] swyx: Schaefer. Yeah, I, I covered him on, on my NeurIPS podcast.
[00:17:03] Yi Tay: Okay, okay.
[00:17:04] swyx: He's a very good speaker, and the paper was well done. It's just that people drew the wrong conclusions from the paper. Because they had a very good title. Do you believe in emergence?
[00:17:12] Yi Tay: Of course. Okay, high five.
[00:17:14] swyx: I mean, how can you read any paper, read any, the progress of LLMs and not believe in emergence?
[00:17:20] It's so stupid. Like, just because you re parameterize some benchmarks in evals and, you know, make it linear, doesn't mean emergence is completely gone. And even in the Mirage paper, they acknowledged that there were some metrics that were true, genuine emergence, according to them. I think it was something like 25 ish percent in the ballpark.
[00:17:38] That's not the exact number, but it's in the ballpark. So I was like, okay, fine, like some benchmarks you disagree with, but on the whole, there is emergence, it's just, now we're just talking about the magnitude.
[00:17:47] Yi Tay: Yeah, yeah, yeah, for sure, for sure. I don't think the authors of the paper had really very Like they, they didn't, I mean, nobody, we, we should just assume people don't have bad intentions.
[00:17:55] Right. But like, no, they, they definitely were just doing this. But like the, the, I think the Popul media, I was more like annoyed by the nearest best people. I mean, okay. Best people was, let's take the thing, take of a grain of salt, right? Yes. But like, there were people come to me like, oh, you should care about this because it's the nearest best disprove because it's the nearest best paper.
[00:18:11] I'm like, paper awards like mean anything. Actually, it doesn't mean anything. Right? Like. I think that was more of my where my angst was coming from. I don't, I don't think I really had, I don't even remember who were the authors of that paper, right?
[00:18:23] swyx: I'm sure they're doing well for themselves, and we don't have to dwell too much on that.
[00:18:26] Quoc Le as manager
[00:18:26] swyx: Okay, so a couple more things from Google, and then we can go to Rekha. Kwok Le was your manager.
[00:18:30] Yi Tay: Yeah, yeah. What is I had another manager called Don. Like, I had two managers during my time at Google.
[00:18:34] swyx: So I'm just basically going to ask for quick hits from what did you learn from Kwok? What did you learn from Jason?
[00:18:38] What did you learn from, you know, Juan? Who they are, who they represent to you,
[00:18:42] Yi Tay: like, how they advise you and all that. So Kwok as a manager, he was more like a friend, and we would like talk a lot about, I think Kwok is a very researchy person, he has a lot of like, he's more of like an intuition, I learned a lot about like from him about like, there was no like concrete, like it was more like over time, and it was very implicit, soft kind of feeling, but I think like a lot of research science, we were like brainstorming a lot about like, , I quite like that, you know, when we were But there was this U palm paper that, that didn't like get much, like as much attention that I feel it deserves, but like, I think that was one of the works that I, I kind of like discussed with court quite a bit and like and that time you're releasing the fund two stuff and everything.
[00:19:16] And then like, I think court has a lot of good sense about like what makes a work a good hit and like you know, publicly a good hit. And like a lot of research sense about like what what makes like. like research cool, you know, so I think he has good like intuition as a researcher and I learned quite a little bit about and I also I was going to say that I think Jason also probably learned like quite a bit from Quark and and this also influences like more of like it was not only just like me getting influenced but that there was like Jason getting influenced and then Jason influenced me so I think overall what I learned from Quark's like intuition, research taste, people like chat about AGI sometimes, singularity and stuff like this it was Like, he's nice to talk to as a friend, manager, kind of, he's like kind of a friend figure to me.
[00:20:01] He's very much a researcher more than like a corporate, you know, manager kind of thing.
[00:20:06] swyx: I totally expect that. It
[00:20:07] Yi Tay: was fun, it was fun.
[00:20:08] Jason Wei
[00:20:08] swyx: Jason Wei, what did you learn from him? What is your distillation?
[00:20:11] Yi Tay: Okay, Jason is very interesting. So, I learned in my career, I learned two or three things, major things from Jason, right?
[00:20:18] So, I think the first thing I learned from him is that so Jason was actually, okay, I'm going to talk about the more casual, more fun stuff first. Jason was the most spicy on Twitter first before me. There was an era where I was a goody two shoes, I only had my main account, I only tweet my only tweets Newspaper alert, you know?
[00:20:34] Right. And then Jason was starting to post like, like hot takes, right? Yeah. And I just thought to myself, oh damn. Like, you know, and there were, there were types that I was like, Jason, you should not post this. You're gonna get cancer. Right. And he, he, he was fine. He, he always break through the storm and everything until I, I looked at him and I'm like, maybe it's not that bad after all.
[00:20:50] Just be, be, I love it. Right. So there was like kind of like, which is very interesting 'cause Jason is much younger than me and I. And the other thing also, our accounts, right, we created them around the same time, right? And the interesting story behind it was that Jason's account and my account has our own, our original it was not like an anime character that nobody know who is it.
[00:21:09] We have our identity. It's pseudonymous. It's pseudonymous, right? And then I asked Jason why do you want to have a So like, why don't you just make like, and he told me this thing which was quite true was that like, Okay, you can post a thing that is spicy and it's hot, but if you cannot stand by the opinion, then you should not have the opinion in the first place, right?
[00:21:25] Wow. Right, so there was something that, oh, okay, I thought that was profound because so far this, I mean, there are times where, okay, I post something and it's spicy and then, okay, it gets a little bit bad, and then I, okay, I kind of agree that, okay, this is bad, then I will retract it. But if I could stand by the opinion, then I would just stand by it because that's the point of making it It should be said.
[00:21:42] Right, it should be said because. I can put my name behind it, right? So that was This is part of the first bucket about how it kind of influenced my online persona like, a little bit, and then, I mean, and then it turns out that now AGI Hippo is so much more spicy than the cola The cola is just hibernating somewhere, it's not even around, right?
[00:22:00] So, I mean, Jason is also more constrained because he works for he has Like an actual employer, right? And he has to be a little bit
[00:22:08] swyx: more The worst thing about Twitter is that, you know, anytime anyone from OpenAI tweets anything, they're like Did you see this researcher from OpenAI said something?
[00:22:15] And they read tea leaves that are not there, and it makes you very cautious to tweet anything. And so it kills the golden goose, is what I say.
[00:22:22] Yi Tay: There was one tweet, I mean, at the time when somebody was, people were speculating the GPT 2 chatbots, right? And then Jason just posted something on his main account something excited about new experiments being run just a random just, and then people screenshot that, and post like, Yeah, I hate that.
[00:22:35] So I think, I, now I think for, All the count is mostly like personal, like personal stuff , very personal I think he would stay away from non work things non work things.
[00:22:44] swyx: The golden goose has been killed because people on Twitter cannot control themselves from drawing random conclusions from, you know, all these hints and all that Yeah, yeah, yeah, yeah,
[00:22:52] Yi Tay: but Going to like the actual, this is like filler, filler, this is not canon, it's filler.
[00:22:57] I think the second thing I learned from Jason is more about like, from my you know, kind of like, from my own career, it's like, the importance of like marketing and PR. So Jason is actually like, I mean, I was just like, he was actually like really, you know, the emergence, like how many blog posts you wrote about the emergent abilities and how many talks he's given about, about emergent, like a lot, probably like the other day I was just at this webcom keynote and he was giving a keynote again about emergent abilities and it's been two years, right?
[00:23:25] So I, I think one big success of him is that like he, he does the work. Okay. Thanks a lot about like marketing the work itself. I did not like in my early parts of my career, early parts in Google, right? I was putting out a lot of work, but I didn't put in a lot of like effort in like thinking about the, like how the work is going to be received.
[00:23:42] I'll just be like, here's a paper, here's a paper, here's a paper, right? But Jason will be like, I'm going to write this paper and I'm going to like market the s**t out of it. So I, I, I think I learned a lot about like, so every single first author paper that Jason writes in the last, he has like 1000 citations in one year.
[00:23:56] Oh my god. Like no, I mean, not every, but like most of it that he leads. So his hit rate is very high. His hit rate, like impact density, like it's very high, right? So, It's pretty interesting but I kind of see him as like a peer and I learn a lot from his basically some, some people are just like talented in, in different ways.
[00:24:11] And, and I think that like, I, I looked at how he markets his own work and markets himself actually, right?
[00:24:16] Marketing Research: How to Start from Zero with No Reach
[00:24:16] Yi Tay: If someone is starting from zero,
[00:24:17] swyx: like no Twitter presence, what is the second best thing to do? You mean as a researcher? For marketing, yeah.
[00:24:23] Yi Tay: I, I think you would like the, the, the most obvious. If you're like a re like say hypothetically, you're like a researcher in like a place without visibility or without an end.
[00:24:32] You have no personal visibility. The first goal is always to try to find a mentor or coworker that is like within this circle, and then you start from there, right. Because, and then you get, you get, you know. people from like, who has a visibility and following to retweet. So you will like, work with them the big goal is not about like, I learned this, I mean, this is like, probably a career mistake in my early days, was that you know, instead of like, focusing on like, so called people okay, if you do good work, it's more of like, okay, how am I going to I see this visible researcher from DeepMind, right, or how can I collaborate with this person, and then kind of do something that feels cool, and like, I can win their respect, and that they would like.
[00:25:09] You know, they will be willing to co author for me because the exercise itself was so about how to, you're not trying to please reviewers or anything, you're just, if you can find one semi visible, you don't even have to be like a famous person, that's like a semi few thousands of followers, has a good reputation of research, and then you collaborate with this person, and then when you post the work, you are co authored with this person, and then you get the person to vouch for you, or just, over time, this would It could be from internships, it could be from, you know, just DMs.
[00:25:38] I think, you know, people are nicer than some people, they seem scary, but if you DM them, they're actually willing to collaborate, actually.
[00:25:44] swyx: I was scared of you, actually. And when I DMed you, you turned out a lot nicer than I feared. So thank you for being nice. That's really great advice for people.
[00:25:55] I just want to leave that out there for people. For others who follow, you know the work that the career advice that I give, the title topic of this is pick up what others put down and specifically pick up what your mentors put down. Like, mentors always have more work to do than they have personally time for.
[00:26:09] The high visibility mentors, and if you can show that you're a good collaborator with them, they will lift you up. Accordingly, that's a pretty good formula for career growth. Should I ask about Hyungwon? Or I don't know how close you are. Oh, we're
[00:26:21] Yi Tay: still good friends. Hyungwon is a great engineer and he's very systematic in the way he thinks.
[00:26:26] I think Hyungwon is without going into detail, I still spend a lot of time talking to Hyungwon, even like in the, even after we both are different places, about like very interesting algorithm, arithmetic ways to think about life. Very interesting like, perspectives on life rather than research.
[00:26:43] But Hyungwon is a great engineer. And the one thing that scares me about Hyungwon is he doesn't have multiple monitors. He just codes with one small screen. And he does everything with very hyper optimized. And then I
[00:26:54] swyx: want those U curve where one screen, one screen, and then many screens.
[00:26:57] Yeah, yeah, yeah.
[00:26:58] Yi Tay: So I think Hyungwon scares me because it's like, I think that was at NeurIPS 2022. Like, we were doing some work at the New Orleans. And then He'll be coding perfectly fine with this 13 inch MacBook with one terminal, and then he'll be like, he keeps telling us okay, it's more optimal to using keyboard is more optimal than moving your head, because if you can switch your screen fast enough, it's faster than your head, like moving to different screens and stuff.
[00:27:24] I did not actually distill that, because it's too painful to do that, but it's very interesting in a way that I'm he belongs to one of those hardcore people with one monitor and
[00:27:34] What's needed to be a successful AI Researcher?
[00:27:34] swyx: Maybe this is a relevant question to just close out the Google site. What do you think is a good programmer for AI research?
[00:27:42] Yi Tay: You mean set up or eating? No, no, not set up.
[00:27:46] swyx: Not even lifestyle. It's more about skills. Like what should people have? What do you interview for, maybe? What do you see that the high performers do differently than less high performers?
[00:27:54] Yi Tay: I mean, okay, like generally, there's like, I think like for AI researchers, like being a strong IC is like probably like the thing that I feel like is like important for AI researchers.
[00:28:03] Like not, not I think There's a certain level of sacrifice to be an AI engineer, AI researcher, especially if you're training at LNs, because you cannot really be detached from your jobs could die on a Saturday at 4am, right, and then there are people who will just leave it dead until Monday morning, and then, or but there will be people who will crawl out of bed at 4am to restart the job, or to Check the, you know, TensorBoard or something like that, right?
[00:28:31] I think a lot of being a successful AI researcher, I don't want to say passion is also the entire thing, but it's more of just the a kind of personality that, if something, there's a bug at 3am on Saturday night or something, right? And then you would like, be like, you couldn't go back to sleep unless you, you, I'm not, this is very unhealthy by the way.
[00:28:50] People should not do this for a long time. You know, I think this kind of things actually like, allows people to make progress faster. But it's unhealthy, so I'm also not even sure like, what's like the, checking out on like, Friday, Saturday, Sunday, and like, 9 to 5 if you want to like, make progress, or like, some people are just so good at detaching like, okay like 8pm, I'm not going to My job can die and then the chips can stay idle for like the whole night, but I want to watch Netflix, right?
[00:29:15] You cannot, I think there's a level, it's like a sport you cannot win an Olympic gold if you want to have super, ultra good work life balance, right?
[00:29:23] swyx: Yeah, passion, intensity, dedication. Yeah, intensity, right? So those are really good personal qualities. Just technical qualities wise, how much of the stack should people know?
[00:29:32] You know, if I Okay, so
[00:29:33] Yi Tay: that was the question.
[00:29:34] swyx: No, no, no, but that was important as well. Okay. It's just harder to interview for because you really just see it on the job.
[00:29:40] Yi Tay: I think stack is not not, not, stack is not that, like Like, should I know CUDA kernels? I don't know CUDA kernels.
[00:29:45] swyx: Exactly, right? Okay, good.
[00:29:47] So for all you listening out there, you don't have to feel like an imposter. No, but, but you need to be willing to learn if you have to, I think. Well, you haven't had to so far. Yeah, I haven't had to so far, right. So if I sling pie torch, okay, great. You know, what kind of do, do I do I know like distributed systems, like do I know, like what, what is the, what is the stack that you recommend for people that get, gets you like a well-rounded end-to-end researcher.
[00:30:08] Yi Tay: I don't, I, I don't think there's any specific thing. In fact, I will try to be as I don't really say like, okay, you need to learn Jax, you need to learn this. By the time you think there's a new frame out anyway, so, so it's more of like. Staying like constantly, like trying to, being able to continuously learn and update.
[00:30:24] I, I don't think that's a single, single stack or like a single single like workflow or I don't think that's a single one. Yeah.
[00:30:31] Reka Origin
[00:30:31] Yi Tay: Well, that, that leads us to Rebecca. Yeah. What's the founding story? So, I, I met some of my other co-founders while we were collaborating at that did my end. I was at, at brand and they were like a DeepMind.
[00:30:41] I'm not like a, a, a startup person. I, I I, I identify even today. As a scientist and a researcher more than like a startup person, right? My co founder, Danny, started this story. Right. And then this, this record was like, in the works from like, late 2022. I, I finally left in 2023. Then he kept asking me, he wants to do something.
[00:31:01] Do I want to go with him and do it? And, and it took, took a while for for me. Also, I was like, kind of the last co founder to kind of form the Was
[00:31:07] swyx: the plan always for you to leave at some point and join him? No, no. He was convincing you to do it. It was
[00:31:12] Yi Tay: like, it was like a six months, more or less, in fact I think more than six months period of like, I always had this at the back of my mind for since like, what, August actually, I didn't want to do it in the first place.
[00:31:25] But I think eventually in March, I felt that okay, it's time for me to experience something new. Like, my leap of faith was more of like, I want to experience something new. I've, okay, I've like, Wrapped up this palm to work at Google and then like more of like, okay, let me experience this new life and see Where we can go with this and I also I mean, we don't have a lot of like, okay The funny thing was that like many many years ago before I PhD I wanted to do a startup actually at that point and then over time I realized that like I was better off as a researcher And I just forgot about the startup thing and it's quite funny that today I end up doing a bigger startup, right?
[00:31:58] but even until now I I actually I did identify more as like a researcher and scientist. Well, I mean, it's not, when you
[00:32:05] swyx: left you already had a high profile coming out of Brain. You could have gone to any startup out there. They all had wanted you. Yeah, okay, okay, yeah. So why did you choose this one, basically?
[00:32:13] Like, is it just because of pre existing relationships? Because, you know, It wasn't obvious to me. A lot of it, the other coworkers went to OpenAI, others went to, the, if you're, if you're fair, you went to Misra, you know, that kind of stuff. Right? Like Rico, Rico was like not on the, on
[00:32:25] Yi Tay: the map.
[00:32:26] Yeah. I, I, I think it was, for me, it was the ion between staying at, at, at Google and like co-founding something. I, I, I didn't want to like, like it was more of the experience of like being a co-founder. And this is like what attracted me, right, and wanted to experience that. I wouldn't have left for Inflection, or something like that.
[00:32:42] Like, I mean, Inflation is gone, but
[00:32:43] swyx: like RAP? They're still alive. They're selling themselves as a model foundry or something. I don't know, there's a services company now.
[00:32:52] Yi Tay: Yeah, I know, but I also think that like Like, for example if you were to join another, it would be like a very big tech experience again, right?
[00:32:58] I don't know, I felt like, the experience I get is very complementary to what I have that's the experience I had at Google, right? But if I were to join something else, right, then I wouldn't have, I would have just stayed at Google, to be honest. Because to me, it was very clear just two decisions that, that I didn't really I was talking to a bunch of other startups, but I didn't really actually had the intention to I was happy at Google, actually, to be honest.
[00:33:19] I'm sure,
[00:33:19] swyx: I'm sure they have a lot of things to keep you happy. I was happy at Google, yeah, actually.
[00:33:24] Starting Reka Infra
[00:33:24] swyx: So, you describe yourself as GPU poor, but also you had 60 million dollars to play with. You got a whole bunch of GPUs. I think you disclosed somewhere, but I don't remember the exact number. And you had a good training run for Flash and Core and Edge.
[00:33:39] How would you tell that sort of story? Like, people can read the technical report. But also you know, what was that overall experience like? And you should also point
[00:33:47] Yi Tay: people to the blog post that you wrote. There were a lot of interesting things that happened along the way that So I think I left around like early April, the end of March, April and everything, right?
[00:33:58] But most of our compute actually came in December, actually. And there were delays. So H100, there were major delays, right? So we were sitting around, right? And to be clear,
[00:34:07] swyx: you don't own the compute, you are renting.
[00:34:09] Yi Tay: Yeah, yeah, yeah. So we were sitting around. Like, we've, you know, for a long period of time, we had 500 A100s, because we made a commitment and they were constantly being delayed, I think because of H100 supply, demand, whatever reasons.
[00:34:23] And it was also very hard to get a lot of compute in one place, right? And then we were locked in, and we had to wait for the compute to come, right? So, I think, It was very painful because even when the compute came, it was mostly broken most of the time. And it was broken to a very bad extent that, you know, before I left Google I was like, even in the early stage I was very optimistic about You Okay, this compute translates to this amount of flops, this is the model, right?
[00:34:48] But I never expected the reliability to be so poor that it just threw off all the calculations and then we had to, work ten times harder just to make the thing go smoothly. So, it was a bearable pain. I think the pain was bearable, but it was just way, way more than expected.
[00:35:04] Why not to use TPUs outside Google
[00:35:04] swyx: I think you addressed this in your post, but the temptation would have been just to run everything on TPUs. Which is the stack that you already know very well. That that works very
[00:35:10] Yi Tay: well. Oh, no, no. So, so, so TPUs outside Google and t inside Google are probably very different things, I think. Oh, how come?
[00:35:16] Okay. First thing is like infrastructure. Like, there was, there wasn't like a lot of good code bases like outside Google that was like still, right. And, and the code base that I was most familiar with was like T five X. It was a jack space. It would have been like, by, by the time we wanted to consider it, it was really like.
[00:35:31] Debrigaded for nine months, right? And then, TPUs I mean, we weren't sure about I mean, the availability of TPUs was not great, great.
[00:35:41] swyx: Oh, my perception is that it was a lot better. People have the learning curve.
[00:35:44] Yi Tay: Yeah, but at the point of time, we had our infra set up, we were training already, training models, and it would be so much cost to, TPUs.
[00:35:50] So I think TPUs, the experience of TPUs inside and outside Google, I have not actually run a single TPU job outside Google, by the way, but just looking through documentation from what I see outside, it's great. And from like, how much I think that people inside Google don't care about what people think outside Google, I kind of feel like, okay, we were a bit like, I don't think we considered, I mean, not like forever not considering this, but like, just like, At that point of time, it was like, The obvious choice is to stick to PyTorch.
[00:36:15] Just stick to GPUs and PyTorch and make like, I mean, it's not as if the chips we ordered were not there, they were there, they're just not. In the best shape. Reliable. Right? Yeah. So I think it was too much work to, to kind of migrate suddenly to TPUs. Yeah.
[00:36:29] Chaotic vs Stable Infra
[00:36:29] swyx: For those who haven't read the report, you had a very traumatic description about the chaotic and stable phases of various compute providers, and I was just wincing when I was reading all those things.
[00:36:40] Yi Tay: Yeah, no, that was like a 3 body problem reference, the chaotic and stable phases. I mean, I was watching 3 body problems at the time, and I thought it was fun to, there was a lot of like, I think we had a lot of fun adding a lot of references and memes into the tech report. I think like, you know, it goes to show like how fun the environment is within, within record, right.
[00:36:57] We had a lot of fun with this, but so I think chaotic and stable face mostly. It's like we, we actually found that, like usually when like provider provisions, new nodes or they would like Yeah. You don't wanna be the first to use it. Yeah. It is usually like, like bad like dog s**t. Like at the, like at the start.
[00:37:13] Right. And then. It gets better as you go through the process of returning nodes and, and, , draining them, giving it back to them, they will send it back for repairs, and everything and then over time, because it's more of it's more of a numbers game, right? If there's one bad node, It kills the entire job, right?
[00:37:30] So like, the fact of, the game became like, just eliminating bad nodes from the thing, right? And then, you know, I mean, just because of, maybe because of the supply issue or something, when the deadline comes to ship this, for example I just give rough numbers, let's say you order 1, 000 H100s, right? They will not be able to, usually they don't meet the demand of like 1, 000 H100s at the date.
... [Content truncated due to size limits]