Nicholas

Factory’s Matan Grinberg and Eno Reyes Unleash the Droids on Software Development

Nicholas

Archimedes said that with a large enough lever, you can move the world. For decades, software engineering has been that lever. And now, AI is compounding that lever. How will we use AI to apply 100 or 1000x leverage to the greatest lever to move the world? Matan Grinberg and Eno Reyes, co-founders of Factory, have chosen to do things differently than many of their peers in this white-hot space. They sell a fleet of “Droids,” purpose-built dev agents which accomplish different tasks in the software development lifecycle (like code review, testing, pull requests or writing code). Rather than training their own foundation model, their approach is to build something useful for engineering orgs today on top of the rapidly improving models, aligning with the developer and evolving with them. Matan and Eno are optimistic about the effects of autonomy in software development and on building a company in the application layer. Their advice to founders, “The only way you can win is by executing faster and being more obsessed.” Hosted by: Sonya Huang and Pat Grady, Sequoia Capital Mentioned: Juan Maldacena, Institute for Advanced Study , string theorist that Matan cold called as an undergrad SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , small-model open-source software engineering agent SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , an evaluation framework for GitHub issues Monte Carlo tree search , a 2006 algorithm for solving decision making in games (and used in AlphaGo)

Published
Published Jun 25, 2024
Uploaded
Uploaded Jun 11, 2026
File type
POD
Queried
0

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:32

[00:00] I would have thought this would be different 13 months later, but this is still very much the case where, [00:05] Agent is synonymous with unreliable, [00:08] Stochastic. [00:10] demo wear, vapor wear [00:12] And I think something very important for us is [00:15] We want to build these systems that [00:17] aren't just like cool examples of what is to come, but rather valuable today. And not just valuable for like a hacker on a side project, but valuable to enterprise engineers today. [00:44] Hi and welcome to Training Data. [00:45] We have with us today Matan Grimberg and Ina Reis, founders of Factory. [00:50] Factory is building autonomous software engineering agents, or droids, [00:54] that can automate everything from the drudgery of maintaining your documentation to actually writing code for you. [00:59] In doing so, they are building the ultimate compound lever. [01:03] Last week, Factory also announced some impressive results on the key AI coding benchmark, Sweebench. [01:08] beating state-of-the-art by a wide margin. [01:10] Stay tuned at the end of the episode for more context on how they built it. [01:15] We're here with the Tom Grinberg and, you know, race. [01:18] founders of factory gentlemen thank you for joining us [01:22] Thank you so much for having us. [01:24] Yeah, thanks for having us. [01:26] Let's start with a little bit of personal background. And Matan, maybe we'll start with you and then go to Eno. So Matan.

1:32-3:04

[01:32] One thing that I believe you and I share in common... [01:36] is an affinity for a well-executed cold call. [01:39] I know at least two cold calls that have had some bearing on your life. Why don't we start with the one that you did... [01:44] As an undergrad at Princeton, [01:47] to somebody who my partner Sean McGuire tells me is quite a famous physicist. Can we start with that cold call? [01:53] Yeah, absolutely. So while I was at Princeton, I was studying string theory [01:59] um... [02:00] and [02:01] The most famous string theorist happened to be working at the Institute for Advanced Study, which is a... [02:07] an academic institution right next to Princeton University, but not technically affiliated with it. [02:12] Part of the allure of going to the IAS is that you don't have to take on [02:17] graduate students, much less undergrads. That said, [02:21] You know, there's a professor there, Juan Maldicena, who is by far the kind of leader of... [02:27] the string theory movement and uh [02:30] You know, being a young, ambitious undergrad, I... [02:34] I decided, you know, might as well see if I could snag him as an advisor. [02:39] And so with some advice from some graduate students, I... [02:43] Send him an email. [02:45] "'asked if we could meet. [02:46] And the thing about Juan is the way he works with people. You know, he'll take a meeting with anyone, basically. [02:51] And we'll spend about two hours at the chalkboard with you. [02:55] And in this two hour chalkboard session, [02:58] He'll subtly drop. [03:00] problem that you basically have 24 hours to solve.

3:04-4:37

[03:04] get back to him with the solution, [03:06] And then you'll officially, you know, be a student of his. [03:09] Luckily, I was warned about this rite of passage, so I was paying close attention to any hints he was dropping. [03:19] Indeed, yes, yes. [03:21] So found the problem, ended up spending, you know, [03:25] basically the entire night working on it. And, you know, luckily ended up having him as an advisor. [03:30] We were able to then publish a paper together, which was... [03:34] very exciting. [03:36] Excellent. [03:37] Yeah. [03:37] Typical undergrad experience. [03:40] Yes, yes, exactly. [03:42] So there's a second cold call that I want to ask you about. Before we get to that, why don't we go to Eno? So Eno, you similarly went to Princeton. You have a CS degree from there. You spent some time as a machine learning engineer at Hugging Face, which is where we first intersected. [03:55] spend some time at Microsoft. But like a lot of great founders, [03:59] Your story before then started with some humble beginnings. Could you say a word about the stuff that doesn't appear on LinkedIn that has helped to shape who you are today? Yeah, absolutely. And I think the [04:09] you know, [04:10] My family on my dad's side came from Mexico in the late 60s to San Francisco. [04:16] And my grandparents [04:18] We're both working for a bit, but when my dad was born, [04:21] They started a Mexican restaurant in Los Altos. [04:26] And that was in the 70s. They moved it to Haight & Coal in the 80s. [04:30] were a very kind of like San Francisco immigrant story. They actually ended up leaving to Georgia where I grew up.

4:37-6:07

[04:37] But... [04:39] Really, I think it's the drive that they had to give my dad a successful life in America. [04:45] and it was my dad and my mom that drove that same kind of mentality into me growing up and I think [04:52] It's really cool because this story is one that I think a lot of Americans share. [04:56] and something that makes it really exciting to be back in San Francisco, kind of building something to potentially make the world a better place for everyone. [05:05] Very cool. [05:07] That is the dream. [05:08] um [05:09] Latan, I want to get back to that other cold call. [05:12] Because I think it leads directly into the forming of factory. [05:16] So our partner, Sean McGuire, who I mentioned earlier, [05:20] who I believe shares a similar academic background to your own. [05:24] received an email from you, [05:26] a year or so ago [05:28] That led to a walk. [05:29] And very shortly thereafter, Factory was formed. So I'm curious... [05:34] What caused you to cold call Sean Maguire? And this is less of a Sean Maguire question because we know plenty about Sean Maguire. [05:41] This is more of like you're on a very good path. You're doing really good research. You're on track to get a PhD in physics. [05:49] And something inspired you to go in a different direction. And I'm curious, what was it that inspired you that led to that cold call? And maybe tell us a quick story about what happened shortly thereafter. Yeah. [05:59] Yeah, absolutely. So [06:01] Like you said, I was doing my PhD at Berkeley. About a year in, though, I realized that

6:08-7:40

[06:08] I was only doing theoretical physics and string theory because it was hard and not because I actually loved it. [06:14] which is obviously a bad reason, a bad reason to do anything. And, uh, [06:18] I had such tunnel vision on this path that [06:22] you know, when I came to this realization, it was kind of earth shattering [06:25] and [06:26] looked at the paths ahead of me and they were basically [06:28] three options that [06:30] seemed realistic. And so it was either going into quantitative finance, [06:34] going into big tech, or going into startups. [06:37] And by this time, I had already kind of switched my research at Berkeley from being purely physics to an ML in physics and then slowly more ML and then mostly AI. So it was kind of quickly, quickly cascading there. [06:50] At the time, I think I saw a video of Sean speaking... [06:55] I think over Zoom to some founders at Stanford or something. [06:58] And I recognize his name from string theory research because I had read his papers way back in the day. [07:04] And it was particularly shocking to me because, [07:06] I'm not sure how much time you've spent with string theorists, but normally they're quite introverted. Not, you know, not... [07:13] Not the most socialized. Yeah. And so Sean is, you know, this very different example. And so to me, I kind of like I looked at his background and it was just it was just shocking to see someone. [07:25] who was like so deep and like a bonafide string theorist [07:28] then go in and like, you know, start his own companies, invest in some of the best companies, join Sequoia. [07:33] and be a partner there. [07:35] And, um, [07:37] To me, it was just like, oh my god, this seems like someone who...

7:40-9:14

[07:40] is of my [07:42] kind of background of my [07:45] nurturing, I guess. And so [07:47] sent him an email and I was just like, hey, you know, we both were string theorists. I don't want to do string theory anymore. I'm thinking about AI. Would love to get your advice. [07:55] Like you mentioned, that bend turned into a walk. [07:57] It actually was supposed to be a 30 minute walk. [08:00] We ended up going from the Sequoia offices in Menlo Park [08:03] all the way to Stanford and then back. [08:05] And so it ended up in three hours. [08:07] He missed a lot of his meetings that day, so... [08:11] It was pretty amusing, and basically at the conclusion of the walk, he... [08:15] So one thing was for sure he was saying, you must drop out of your PhD. There's way too many exciting things to do. [08:21] And he kind of left me with the advice of, [08:24] You should either join Twitter right now, because this is just after Elon took over, [08:29] And he was saying it's, you know, [08:31] Only the most badass people are going to join Twitter right now. [08:34] two, [08:34] you should join a company of mine as just like a general, you know, glue is what he said. This was foundry, by the way. Yep. Or three, if there's some ideas that you've been thinking about, you should start a company. [08:46] And I was like, you know, [08:47] very grateful um [08:49] of all the time that he spent and we kind of left off there. [08:53] beautifully in parallel, Eno and I had just [08:56] reconnected at a lane chain hackathon. [08:59] Um, and he was in Atlanta the weekend prior and he basically got back the next day. So that next day, you know, and I got coffee. [09:08] And... [09:09] I think we got coffee at noon, and then basically every hour since then until now...

9:14-10:46

[09:14] Eno and I have been working together, talking constantly about code generation and what became Factory. [09:22] Did you guys know each other in undergrad? [09:26] We had, like, the maximal overlap of mutual friends without ever having had a one-on-one conversation. [09:33] Yeah, it's pretty funny. We were in eating clubs at the time opposite from each other, and we had just so many mutual friends. And it really wasn't until I moved to the Bay Area that we had a face-to-face combo and we... [09:48] uh, [09:49] It was a very fruitful conversation for sure. [09:51] It was intellectual love at first sight, you could say. Absolutely. [09:56] I love that, and it's so serendipitous with the Langchain connection. How did you guys decide on... [10:02] I'm curious, I mean, you're both brilliant. And I think for a lot of founders starting out, [10:06] in AI right now, a lot of them find it hard to resist the siren's call of [10:10] I'm training a foundation model. So like, how do you decide to, you know, build in the application layer? [10:16] I'm curious, and then why software engineering? [10:20] Yeah, so I think from my perspective, like going deep from academia, I think [10:25] Throughout all the years of spending time on math and physics, [10:28] The thing of beauty that I learned to kind of be drawn to was... [10:33] things being fundamental. [10:35] And, you know, spending, you know, time doing AI research, it was so clear that [10:40] code is fundamental to machine intelligence. And so I was just naturally attracted to

10:46-12:17

[10:46] the role that it plays there. [10:48] And I think that kind of joined quite well with Eno's, you know, attraction to the space. [10:54] You've referred to it a couple of times as a compound lever. [10:58] Can you unpack that for us and let us know what that means? [11:02] Yeah, I mean, so there's the famous Archimedes quote about... [11:06] you know software or well his quote is rather you know if you have a large enough lever [11:10] you can move the world. And then I think that's been co-opted for software engineering, right? That software is a lever upon the world. [11:17] And for us, we see AI and in particular AI code generation [11:21] as a lever on software. [11:24] the impacts of that being [11:26] you know, compounding exponential. [11:28] I'm sorry, Eno, I think I cut you off. I think you were... [11:31] Maybe... [11:32] mentioning how you got to the [11:35] the founding inspiration for factory. [11:37] Oh yeah, absolutely. I mean, I think Matan's story is really indicative of kind of the energy at the time. [11:44] at Hugging Face working on training, optimizing, deploying LLMs for enterprise use cases. I was [11:54] early like Langchain kind of integrations. [11:56] And it was so clear that the work that was happening in open source... [12:00] was directionally moving towards. [12:03] uh, [12:03] modeling human cognition with LLMs where the LLM was just one piece of the system. [12:09] the idea of chains and I think Harrison calls them cognitive architectures or the lane chain folks call it that [12:16] um...

12:17-13:46

[12:17] and seeing that happening and seeing that [12:20] within the cogen space. [12:22] the most advanced players were basically looking at autocomplete. [12:25] It felt like there was a huge opportunity to take that. [12:29] to the next step and take some of those lessons that were happening both [12:33] in kind of the fringe research and open source communities and applying them towards kind of [12:39] massive organization. I realize we haven't said explicitly yet, [12:44] What is factory? So Matan, [12:46] What is factory? [12:48] And then maybe what are a couple of the key decisions that you've made about the way factory is built? And, you know, for example, one of them. [12:56] is to start by benefiting from all the ongoing improvements in the foundation model layer, [13:00] One of them might be the product itself, but can you just say what is factory and what are some of the kind of key decisions you've made that have shaped factory today? [13:09] Yeah, absolutely. So Factory is a cutting-edge AI startup. [13:14] Our mission is to bring autonomy to software engineering. [13:19] What that means more concretely [13:21] We are automating tasks in the software development lifecycle [13:25] And in particular, tasks like code review, documentation, testing, [13:30] debugging, [13:31] refactoring [13:33] And, you know, as I list these off, you'll kind of hear quickly that these are the tasks that engineers don't particularly enjoy doing. [13:40] um... [13:40] And that's very much intentional, right? Like, [13:43] Obviously, we are doing code generation and that's really important.

13:47-15:20

[13:47] But I think [13:48] an equally important thing to, you know, [13:51] generating some [13:52] inspirational and like forward-looking [13:55] It's also important to understand what engineers are actually spending their time on. [13:59] And in most organizations, it's not actually fun development work. [14:04] In most organizations, they're spending a lot of their time on things like review and testing and documentation. [14:10] Normally they'll do these things way too late and then they're suffering because they're missing deadlines, right? [14:14] And so... [14:15] Our approach is. [14:17] We want these tools to be useful in the enterprise. [14:21] And so to do that, we need to kind of meet engineers where they are with the tasks that they are very eager to automate away. [14:28] We call these autonomous systems droids. [14:32] And like Eno was alluding to earlier, [14:34] These are kind of... [14:36] There's a droid for each category of task. [14:39] And in this kind of a paradigm where [14:42] We want to frame these problems as games. [14:45] It's very convenient that software development has a clearly defined software development life cycle. [14:51] And so for each kind of category of task or each step in the software development lifecycle, [14:57] we have a corresponding droid. [14:59] Um, [15:00] so that's kind of a kind of a first pass there i guess there were i think there was a second part of your question that uh [15:06] I missed. Oh? [15:07] We'll get into the rest of it. Where did the name droid come from? It's a pretty catchy name. It's very memorable and distinct to factory. Where did that come from? [15:16] Yeah, yeah. So, I mean, keep in mind, you know, when factories started, this was...

15:20-16:52

[15:20] like you mentioned about a year and a month ago, um, [15:24] And, you know, I actually I would have thought this would be different 13 months later, but this is still very much the case where, [15:30] agent is synonymous with unreliable [15:34] stochastic demo wear, vapor wear. [15:38] And I think something very important for us is [15:41] We want to build these systems that... [15:42] Aren't you just like cool examples of what is to come? [15:45] but rather valuable today. [15:47] and not just valuable for like a hacker on a side project, but [15:50] valuable to enterprise engineers today. [15:53] um... [15:54] We felt very strongly that agents just doesn't really capture [15:57] what we're trying to deliver and so [16:00] Fun fact, we were originally incorporated as... [16:03] The San Francisco Droid Company? [16:05] But... [16:06] upon legal advice and given, um, I guess the, the eagerness with which Lucasfilm pursues its trademarks, we, uh, [16:14] We changed our name to Factory. [16:16] Fair enough. So is it fair to say then that a droid is sort of like a job-specific autonomous agent that actually works? Is that a reasonable way to think about it? [16:27] Yeah. Okay. [16:28] Exactly. [16:29] You just said the words cognitive architecture, and I know my partner, Sonia Huang, well enough to know that this is her love language. So I'm sure that Sonia's mind just lit up with a whole bunch of questions for you. So I don't want to get in the way. Sonia, have at it. [16:42] We just had Harrison on the podcast who talked about custom cognitive architectures as well. [16:46] I guess, what are you doing on that front, and how do your implementations dovetail with the multi-droid strategy that you're taking?

16:53-18:24

[16:53] Yeah, absolutely. I mean, it's a great question. And I think... [16:57] The way that we think about... [17:00] reasoning and cognition within the standpoint of these systems. [17:07] There are clearly huge innovations happening on... [17:12] both layers, the foundation model layer [17:14] as well as on the kind of orchestration or... [17:17] kind of application layer. [17:19] The way that... [17:21] you can kind of think of our technical approach on this. [17:24] is that... [17:25] Uh... [17:26] you know, traditionally labs like DeepMind and kind of some of these, these orgs that are really focused on solving problems. [17:33] that you can model like a game, uh, [17:36] where you have rules and an environment and feedback loops. [17:40] You can build out systems which model the reasoning of humans. [17:44] and even outperform them. They did this with the Alpha series of models, [17:48] protein folding, Go, code. And for us, most of the reasoning work we do [17:53] is similarly focused on kind of inference time [17:57] reasoning search through kind of decisions and and and what we kind of think of as you know [18:03] maybe it's something of intuition maybe it's something of planning but uh [18:09] We aren't training foundation models yet. [18:11] And I think the... [18:13] A lot of the innovation that's going to happen at the foundation model layer will be things like, [18:18] and context window and kind of performance on some subset of tasks.

18:24-19:55

[18:24] But any time that you need... [18:27] action and environmental feedback and kind of [18:30] long-term planning. [18:33] it's going to be really difficult to build a single foundation model that does that. And I think it's really the application layer where those types of innovations are going to happen. [18:43] Yeah. I thought the Princeton SWE agent paper that came out... [18:48] Last week or so was really interesting as an example of that of like you can get [18:52] Incredible agentic. [18:53] Raising performance on encode tasks. [18:56] From small open source models, I thought that was really nice, a proof point of what you're saying. [19:01] We love the whole team that put that together and the Sweebench work. [19:05] I think is a popular benchmark in the space. I think it's, [19:10] you know, it's, [19:11] clear that [19:12] a lot of the effort towards building these systems relies on [19:17] Not just like any one... [19:18] benchmark or eval or set of tasks, but rather collaboration across a bunch of different areas, whether it's the model layer, whether it's the tasks themselves, it's what data are you using to evaluate [19:30] and ultimately the overall architecture. And yeah, they're a really great team. We're super pumped to see their work. [19:37] Okay, last question on this, and then I will pause myself. [19:44] Any favorite cognitive architectures? Like, is it the... [19:47] tree of thoughts of, chain of thoughts of, like, [19:50] Any favorite cognitive architectures that you think are especially promising or fruitful in your field?

19:55-21:25

[19:55] Yeah, I think that's a great question. I mean, I think kind of what I alluded to previously, when you have like the almost like the game-like... [20:04] problem space where there are [20:06] kind of simulatable, analyzable, and optimizable [20:11] Boundaries. [20:13] then that means that you can search through those decisions. And there's a bunch of techniques like Monte Carlo tree search, language agent tree search that people have talked about in research papers that I think are interesting approaches here. I think that the... [20:29] In my mind, there isn't a singular cognitive architecture that makes sense for it. [20:35] all tasks. [20:36] And a lot of the benefit of breaking down the software development lifecycle into [20:42] kind of semantically meaningful segments is that developers... [20:47] when they have these workflows that move from one step to the next, [20:51] They've kind of defined the boundaries of the game, so to speak. And so a lot of the work we do is figuring out which cognitive architecture or what design makes sense for a given task. [21:02] You're reminding me of the Rich Sutton... [21:04] Bitter lesson search and learning are the two techniques at scale. Yeah, absolutely. And I think you definitely need both. [21:13] And then, you know, you were talking about this a bit, how the sort of the reasoning layer on top of the foundation model. [21:20] is really the focus for a lot of the fundamental research and a lot of the fundamental work that you guys are doing.

21:25-22:56

[21:25] Maton, you had a line a couple months ago when we were talking about [21:29] That was, and hopefully this doesn't come across as Starkey because it's not meant to. [21:34] But it was something to the effect of, [21:36] There are 800 engineers at OpenAI working on my margins for me. Can you say a word about that? Because I thought that was, first, that was incredibly well put. [21:45] And then second, pretty good insight in terms of how you're building the business and really benefiting from the work of the foundation models. Can you just say a couple words about that? [21:54] Yeah, absolutely. So, you know, there are there are a lot of companies that [21:59] a lot of startups that are pursuing training foundational models or fine-tuning models. Then there are a lot of huge research labs like OpenAI, [22:07] an anthropic [22:08] who are also putting a ton of resources behind making these foundational models. [22:12] Um, [22:13] Better, cheaper, faster. [22:15] And [22:17] From our perspective, right, like, [22:18] We don't want to run a race that's not suited... [22:23] to our abilities, right? Or we don't want to fight a battle that we know we won't win. [22:27] Training foundational models, we are not going to win that battle. [22:30] And similarly, I also don't think it's a particularly unique battle at this point. [22:34] I think [22:35] These companies were incredibly unique and innovative, clearly, based on what they're delivering. [22:39] But now I think the stage is set in terms of training foundational models. And I think similarly with a lot of the infrastructure for fine-tuning and that sort of thing, [22:48] What has not [22:49] really come to fruition yet is... [22:52] actually making products with AI that people are using.

22:56-24:27

[22:56] There's so much talk about all these foundational models, all this infrastructure, and there's still... [23:00] very few real products that use this AI. [23:04] You know, in the analogy that, you know, VCs like to talk about a lot, we have a ton of picks and shovels and no one's actually going for gold. [23:12] And so [23:12] The thesis behind how we're building this company is... [23:15] Let's first, you know, use these. [23:17] beautiful black boxes that OpenAI, Anthropic, and Google are spending billions of dollars and, you know, hundreds of engineers to make. [23:25] Let's use these black boxes [23:26] and build a product that people are actually using. [23:29] And once we do that, then we can earn the right [23:32] to do the fancier things like fine-tuning and training. [23:35] Um... [23:36] If you're unable to build a product that people are actually using with these incredible models, [23:41] then chances are fine tuning and training will not save you. And it's probably just not a good product. [23:46] And so that's kind of the approach that we're taking there. And so, you know, we do, you know, we do get a lot of improvements when new models come out. But yeah, we are very much grateful for the work that's being done at these cutting edge research labs. [24:00] I want to, you said a lot about how what you're doing is kind of like making AI immediately practical for engineers in like an enterprise setting. [24:09] And so... [24:10] I want to throw another, I think it's a Matan quote, and I'm not sure if you were quoting somebody else, but [24:14] You said you you're telling us last time. You said if Jeff Dean shows up at your office and he doesn't understand your code base. [24:20] He won't be productive. [24:21] And unpack that for us, like, what does it take to kind of make...

24:27-26:03

[24:27] A coding agent, that's not just good for anybody that boots up a computer, but somebody that's a full-time engineer at a real software company. [24:34] Yeah, totally. And yeah, so the analogy here is that, you know, [24:38] Jeff Dean is the analog of a really, really good... [24:41] foundational model, let's say like GPT-6 with incredible reasoning, right? [24:45] But if it comes into your engineering organization with all your nuances and all your engineering best practices, [24:51] just having that good inference and good reasoning is not enough to actually contribute [24:57] and automate these tasks reliably. [24:59] Some given isolated task sure you can solve like [25:02] give it some, like, leet code problem, and this... [25:05] Give Jeff Dean a lead code problem. [25:07] I'm sure he will solve it. [25:08] But if you have some, you know, 20 year old legacy code base, some part of it is dead code. The other part of it, the last person who was contributing to it just retired. And so no one else knows what's going on there. You need deep understanding of the engineering system, not just the code base. [25:24] but like why you made certain decisions, how things are being communicated, what top of mind priorities are for the engineering organization. [25:32] And it's kind of these like, [25:34] less sexy but incredibly important details, [25:37] that we're really focused on in order to deliver this to the enterprise. [25:40] What about the, um... [25:42] I think a lot of these AI coding companies are kind of focused on the individual developers' productivity. [25:49] How do you think about the individual level optimization versus maybe the system... [25:54] the system-wide optimization. [25:56] I think the important thing to think about with respect to the whole org is –

26:03-27:39

[26:03] when a VP of engineering comes into the room, they're not really focused on whether or not [26:10] an individual person [26:11] completed like one task. [26:13] an hour faster. [26:15] They're concerned about how many tasks are being completed, [26:18] and aggregate metrics of speed. [26:20] But if that person completed that task an hour faster, [26:23] but it's 40% worst code, right? It's churning code where people are gonna rewrite on top of it. [26:29] or that person took that task and [26:33] They did it in an hour, but it took them four hours to plan that, and they were blocking five other engineers. And so when you start to actually add the nuance of what does it mean to be successful, [26:43] measuring an engineering org, you start to bump into a lot of challenges with [26:48] with understanding kind of what needs to be improved and what what is a bottleneck and what is just kind of a a secondary metric. I think a lot of the. [26:57] initial attempts at making AI coding tools, are really focused on first order effects. How quickly is somebody [27:05] tabbing to autocomplete a statement or how quickly is somebody completing an individual task? [27:12] But I think that at factory, a lot of what we're trying to do is understand [27:16] From an engineering leader's perspective, [27:18] How are you measuring performance? And what are the metrics that you look at to understand, hey, we're doing really well as an org or hey, we need to make improvements? [27:27] and targeting those. And I think metrics like code churn, [27:30] end-to-end open-to-merge time, time-to-first answer within the EngOrg, all of these things are much more impactful

27:39-29:12

[27:39] to an organization's speed of shipping code. And so that's kind of how we think about it. [27:45] I think this really ties into what Eno was just saying quite well, which is... [27:49] You know, [27:50] The clearly we were talking about products earlier as well, like clearly the AI product that has penetrated the enterprise the most is Copilot. [27:57] Right. [27:58] um, [27:59] Unfortunately, with a tool like Copilot, the things that are kind of the metrics that are really held up as success are things like auto complete acceptance rate. [28:09] And the problem is exactly to your point, if you're a CTO or a VP of engineering, [28:13] How do you then go to the executive team and say, hey, look, our autocomplete acceptance rate is this high. [28:18] They don't know what that means. They don't understand how that translates into like business objectives. Right. [28:23] And also, you know, Ina was alluding to this. [28:25] There's kind of a... [28:26] hidden [28:28] you know, danger to some of these autocomplete tools, which is [28:31] Orgs that use tools like this end up increasing their code churn by anywhere from 20 to 40%. [28:37] there there's some studies that look into this there's some problems with these studies but you know directionally what's clear is that [28:43] As the percentage of AI generated code increases, [28:47] code churn if if you're not you know doing anything different in your review process [28:52] code churn is going to go up. [28:54] And so... [28:55] Our reason for focusing on org-wide metrics [28:58] is that it kind of divides out all of these concerns. [29:01] If we look at things like how fast are you completing your cycles? What is your code churn across the org or across these different repos that divides out these

29:12-30:42

[29:12] kind of like smaller, like intermediate metrics... [29:16] and gives you a sense of, hey, we are shipping faster. [29:19] and we're churning less code. [29:21] Um, [29:22] So that's really how we how we talk about this with these engineering leaders. [29:26] At the end of the day, the three... [29:28] kind of main axes we look at [29:30] is saving engineering time [29:32] increasing speed. [29:34] and improving code quality. [29:36] And ultimately, so these are three and again, [29:38] There's kind of different... [29:40] uh... [29:41] you know, complexity of metrics for different parts of the org. [29:43] These are the three that we discuss with engineering leaders [29:46] But we want to arm them with information when they're talking to, let's say, their CFO. [29:50] And so really we kind of, uh, [29:52] break that down into one [29:54] main metric, which is engineering velocity. [29:57] And that's really what all of these droids are targeted towards is [30:00] increasing engineering velocity. [30:04] Let me try to recap a couple... [30:06] parts of the story thus far. [30:09] In some ways, this is a compound lever, meaning AI is a lever on software, software is a lever on the world. And so building an autonomous software development system is one of the most impactful things you can possibly do with your lives, which is pretty cool. [30:24] There are a few unique angles to the approach that you guys have taken, or maybe not unique, but distinctive. [30:29] One of which is the decision to write on top of the foundation models, which means that you get to benefit from all their ongoing innovation. [30:35] It also frees you up to really focus on the reasoning and the agentic behavior on top of those foundation models.

30:42-32:16

[30:42] which is part of the reason why you can deploy [30:45] your product as a series of droids, which are basically job-specific autonomous agents [30:51] that do something like test or review end to end in a way that is practically useful to an engineering organization. [30:58] um and instead of focusing on just producing more code you're actually focused on kind of the system-wide output which requires you [31:05] to have really detailed context around [31:08] not just the code base, but all of the systems and processes and [31:12] kind of nuance around the entire environment. [31:15] And having done so, you can increase, you know, velocity for an organization. [31:19] I think that's a bunch of the story that we've talked about so far. [31:23] Let's talk about let's talk a bit about the results. Are there any good sort of customer examples you can share of, you know, factory in action and the results that you've been able to have for people? [31:35] Yeah, so I think some of the main things that we're seeing, like, across the board, and we're not super... [31:41] public on case studies just yet. [31:43] uh... but [31:45] Something that we see across the board is... [31:47] I think our average cycle time increase is around 22%. [31:53] Um... [31:53] On average, we are lowering code churn [31:56] by 13%. [31:58] um... [31:59] tools like, and I guess we haven't even gotten into the specific droids, but [32:03] tools like the Testroid, [32:05] end up saving engineers like around 40 minutes a day. [32:09] which is pretty exciting. [32:12] And yeah, I think kind of going back to what we were talking about in terms of benchmarks,

32:16-33:48

[32:16] One of the most exciting things about having [32:19] Thousands of developers who are actually using these tools [32:22] is that [32:23] we get this live set of... [32:25] of benchmarks and we get evals and feedback from these developers about how these droids are performing. [32:31] And so, [32:32] You know, [32:33] Like, you know, mentioned we are huge fans of Sweebench and what that's done kind of for the general community and giving people, uh, [32:40] like an open source benchmark to really compare these models. [32:43] But, you know, strategically for us, having this deployed in the real world has allowed us to dramatically increase our iteration speed in terms of quality for these droids. [32:54] What have you guys learned along – since you have a bunch of people using this in the real world, what have you learned along the way? And have there been any big surprises? [33:03] Engineers love ownership. [33:05] yeah all right same more [33:08] Absolutely. I mean, I think it really is that, you know, when you're building an autonomous product, [33:14] And the goal is to take... [33:17] take over a task [33:19] you have to deal with developers who are fickle for good reason. They're constantly bombarded with developer tools and automations and anything that's kind of being... [33:29] enforced from a top-down perspective needs to be [33:33] Very flexible. And so making sure that, you know, when we're building these products, we think about what are the different preferences or ideas that people have about how this task should be done. [33:45] and then building as much flexibility into that.

33:48-35:18

[33:48] I think a great example of this is [33:50] the review process. [33:53] Everybody has a different idea of what they want code review to look like. [33:57] Some people want superhuman linters. Some people want, you know, really deep kind of analysis of the code change. [34:06] Some people don't even like code review. They get annoyed by it entirely. Matan has a great quote about [34:12] what code review is like. I don't know. [34:16] If you want to share that. Yeah, yeah. So, I mean, in general, we've kind of internally realized that. [34:21] Um, [34:22] The code review process is very much like going to the DMV [34:26] In that no matter how clean the DMV is, no matter how fast the line is, no one loves code review. [34:32] Because at the end of the day, someone's criticizing you. Someone's going in and looking at what you did and saying better ways you could have done it. [34:39] So in general, the review process, it's the type of thing that as an engineering leader, it's great to see like moving the needle on these organization wide metrics. [34:48] As a developer, [34:49] it's maybe not the most fun thing. Whereas something like the test droid, right, which is generating tests for you. So you don't spend hours [34:56] writing your unit test, [34:57] That's incredible as a developer, but you know, for the engineering leader, it's slightly less obvious how that connects directly to business metrics. [35:06] So I think this is part of why it's important for us to have this fleet of droids. [35:11] Because... [35:12] We're not just building this for the engineering leader, nor are we just building this for the developer, but rather for the engineering organization as a whole.

35:19-36:54

[35:19] Part of what I heard there was that I don't have to go to the DMV anymore. You can just send me my driver's license in the mail. [35:25] Yeah, basically. Love it. It's a good way to sum it up. [35:28] Have you guys seen Pad Drive? I don't think they should be sending him a driver's license. [35:31] We have Waymo for that. [35:35] Speaking of Waymo, uh... [35:39] How far out do you think we are from having fully autonomous? [35:43] software engineers like if you talk about waymo like it felt like it was going to come really fast and then it felt like we went through a valley of [35:49] and now the future is coming out of this super fast again. [35:52] Which inning are we in for the kind of fully autonomous software engineer team? [35:58] cycle. And when do you think we'll have fully autonomous Jeff Deans? [36:03] This is a great question, and I think one that we get a lot. I think one thing that's worth is kind of like reframing what a fully autonomous software engineer will do. [36:14] There have been many moments where technical progress has led to [36:19] kind of, you know, [36:21] labor dynamic changes and increases in the level of abstraction in which people work. [36:27] And I think that historically... [36:29] enabling people to operate or impacts the construction of software with [36:34] you know, at a higher level of abstraction with less domain knowledge, [36:38] has generally led to [36:40] huge increases in demand for software. [36:43] And I think that what we're seeing with the customers we're working with today is [36:49] is that when you free people up from these types of kind of secondary tasks like,

36:54-38:33

[36:54] generating unit tests that map to a pull request or writing and maintaining documentation on a code base that [37:02] 95% of people know, but that documentation comes into play for that 5% that doesn't. [37:07] they start to shift their attention to, [37:10] higher level thinking. They think about orchestration, they think about architecture, they think about, you know, what is this PR actually trying to do, and less about [37:20] did they follow the style guide? [37:22] I think that what we're seeing is that this is happening today already because of AI tools. And over time, as they get better and better, [37:31] we'll see that shift towards, [37:33] uh, [37:33] software engineers becoming a little bit more like [37:37] architects or designers of software. And so in the future, I think there's going to be 10 times more people involved in software creation, where every individual has the impact of maybe 100 or 1000 people [37:49] It just may not look exactly like [37:52] the individual steps of the development lifecycle that we see today. [37:57] You know, that reminded me of a quote that you guys have on your website, which said, and I'm going to read this. [38:03] It says... [38:04] We hope to be a beacon of the coming age of creativity and freedom that on-demand intelligence will unlock. [38:11] And that really resonated with me when I read it because it sort of implies a very positive and optimistic view of the world that we're heading into. [38:20] I wonder if you guys want to say a couple more words on that or... [38:24] or sort of what you think the sort of relationship between man and machine will be in the fullness of time. This kind of goes back to our original approach, which is,

38:33-40:03

[38:33] You know, [38:33] it's very tempting to go after the sexiest parts of software development in particular you know like [38:39] building an app from scratch, right? [38:41] But that's also the sort of thing that will make a developer defensive because... [38:46] That's the part that they enjoy, right? [38:48] And so in a world where you automate the development, [38:51] then an engineer is just left reviewing, testing, and documenting. [38:54] which is like a depressing hellscape if you were to ask any... [38:57] any software engineer right so for us it's very important that we position ourselves [39:03] aligned with the developer instead of you know going into these organizations and being antagonistic with them right like [39:09] by going in and automating the things that we don't want to do, or rather, [39:13] By going in there and automating the things that developers don't want to do, [39:17] We are positioning ourselves with them, right? [39:20] Five years from now, [39:21] I don't think anyone really knows what software engineering will be or even if it'll be called that anymore. You know, to Ino's point, it might be, you know, you're like a software curator or cultivator or orchestrator. [39:33] But by positioning ourselves this way with the developer, wherever that role goes, we will be there side by side. [39:41] to allow them to have this higher leverage. [39:44] And so, yeah, completely agree to your point, like, [39:46] This is one of the most incredible... [39:49] things that is going to happen to, you know, our ability as humans to create. [39:53] And I think for us, it's just incredibly important that [39:56] We are aligned... [39:58] with the users of this product and not [40:00] you know, antagonistic trying to replace them.

40:04-41:42

[40:04] How far do you think we are from having these reliable... [40:07] kind of maybe call it intern level engineers. [40:10] Is it a year out? Is it really here today? Is it a decade out? [40:15] I think – I mean, it depends on the – [40:18] the task for things like code review and testing, I think we're here. We're already there where we're able to operate at a level that [40:28] you know, for many, like there's feedback from one organization that we got in particular, [40:33] that where we brought them the review droid and and this was pretty early on [40:37] And they said, you know, [40:39] The ReviewDroid is the best reviewer on our team. [40:42] And I think that every once in a while, you kind of hear something like that, and it gives you a lot of kind of confidence that directionally, we're definitely moving towards something. [40:51] that is valuable. And for tasks like [40:55] You know. [40:56] hey, we've got to [40:57] decompose our monorepo into a ton of microservices and [41:02] You know, the type of thing that you might arm, like, a staff-level engineer, armed with a team of engineers under them, [41:10] I think that... [41:11] we won't see like a binary moment of, oh, well now this is done by an AI. I think that their responsibilities will slowly start to get decomposed [41:20] into the tasks of [41:22] planning and [41:23] implementing the refactor, going one file at a time. And when they start handing off those subtasks to AI, I think that... [41:33] role will kind of start to be called something different. Because when you're no longer as focused on what is the individual line of code that I'm writing tomorrow, and more focused on

41:42-43:15

[41:42] what is our mission or what is our goal as an engineering team? You really are more of an architect and less of an implementer. [41:50] and a concrete example of you know [41:53] us eating the food that we're creating, right? [41:56] We were dreading for months creating a GitLab integration. Some of our customers use GitLab. [42:02] We want to build cool AI stuff. We didn't want to spend time building a GitLab integration. [42:06] We had our code droid. [42:08] fully spec out [42:09] what the steps of building a GitLab integration would look like. [42:12] And then it actually implemented every one of these sub tickets. We were, of course, monitoring it just to make sure it wasn't breaking anything. [42:19] And, [42:20] We now have a GitLab integration. [42:22] And so this is something that we genuinely were considering getting an intern to do because we just. [42:28] We really didn't want to do a GitLab integration. [42:32] Shut up GitLab. [42:34] Yeah. But like materially, the droids saved us like hours of time. None of us had built a GitLab integration before. [42:42] And also it's just like relatively complicated to like abstract away [42:46] the like source code manager and so [42:50] That was materially intern work that we did today. [42:53] So to answer your question, it is now [42:57] it's just kind of slowly climbing up more and more the level of complexity [43:01] of these tasks. [43:03] Future's really here. [43:04] It is. That... [43:05] I have a question about competition, and not specifically the competition in your space, but... [43:10] I think how you more generally think about navigating competition, I think you guys are the type of founders that...

43:15-44:52

[43:15] lot of companies in the application layer really look up to [43:19] Because you're insanely ambitious. [43:21] Building a real company of meaning. [43:24] you're doing a lot of smart decisions like, you know, writing on other people's models. [43:30] I think the obvious kind of like [43:33] scary thing, scary kind of other side of that is, you know, [43:37] Every other competitor in the space has access to the same models as you. [43:41] And so I'm curious how maybe just mentally and then I guess overall you think about approaching [43:47] competition in this space. Do you think it's more elevated in this [43:50] in kind of the application layer AI market than in other startup markets historically. And how do you think about navigating that? [43:57] Totally. Yeah, I think that's a great question. And I think that's [44:00] Really, you know, our approach to that has defined how we've built out this team. [44:04] And really, I think there are a lot of [44:06] ways you can respond to competition and like mentally... [44:10] kind of [44:11] justify your existence versus competitors. [44:14] I think for us? [44:16] On the team side, [44:18] We are just a team of people who are more obsessed... [44:22] than anyone else out there. And I think that is like something that just has compounding benefit of [44:28] I am willing to bet everything that the people that we have assembled [44:32] are just more obsessed than everyone else working in this space. [44:36] I think [44:37] kind of a corollary to that is [44:39] The only way you can win is by executing faster. [44:43] There is... [44:44] Everything else is all just like sprinkles on top. The only way you can really win is by executing faster and being more obsessed.

44:52-46:22

[44:52] And [44:53] That is what our team is. And I think [44:56] I guess one last thing is [44:58] Having a group of people who [45:00] respond to kind of external pressures, [45:04] as like more motivating. [45:06] um... [45:06] And [45:08] you know [45:09] responding in that way. [45:10] also being very mission driven, right? [45:13] If you know a competitors that does something big and then suddenly you're deflated well [45:17] If you're truly obsessed with a mission, it's irrelevant. [45:19] Right. [45:20] If you're truly obsessed with our goal of bringing autonomy to software engineering, [45:24] All of that is noise. What we need to do is execute as fast as possible. [45:28] in this direction that we've set and [45:31] The rest will sort itself out. [45:33] Love it. Really well said. [45:34] Maybe a few final questions to close us out. [45:38] If you weren't solving the kind of autonomous software engineering problem [45:42] What problem would you be solving? [45:44] I guess I have to be banned from coding agents for this. [45:48] Perhaps robotics. I find robotics very interesting. I think a lot of the time the team here [45:53] A lot of the team comes from backgrounds working on autonomy and robotics. And we talk about how what we're building really kind of resembles that in many ways. [46:02] I think multimodal function calling LLMs are here, and the robotics companies' decreased hardware costs that are coming out are clearly making progress, so it feels like a fun area. [46:14] So you'll be making physical droids. [46:16] Exactly. It's on the roadmap. [46:19] We'll let you give a ton. [46:21] Yeah, um...

46:23-47:55

[46:23] I think this is one of my blind spots where I just suffer from severe tunnel vision. [46:29] I genuinely cannot fathom working on anything else. [46:33] I'm just genuinely obsessed with our mission to bring autonomy to software engineering. If I wasn't working on this, I'd figure out a way to work on this. [46:40] I know that's a cop-out answer, but I genuinely, I can't, it does not compute, so. [46:45] That is, in fact, a cop-out answer, but it is a fantastic cop-out answer, so we will take it. [46:50] One of the questions that I always like to ask is, who do you admire most in the world of AI? [46:55] And tell you what, Matan, because of your background, we'll let you... [46:58] We'll let you look at the superset of AI and physics, if you like. [47:02] Ha ha ha. [47:03] I would say a name that comes to mind when you say that is Jeff Dean. [47:08] I think we mentioned him earlier already, actually. [47:12] His impact in research is one huge side of that. I think TensorFlow and... [47:17] the kind of the work that that whole team has done a deep mind and related. but, [47:23] I've also heard he's a nice guy, and I think that the thing is having responsible... [47:29] I think leadership in the AI community, I think, is really important. And there's a lot of folks who are on Twitter all the time. [47:38] you know, clashing. And I think that the seeing folks who are outside of that side of it, I think is pretty great. [47:46] Yeah, and I think... [47:47] not to give you guys a double cop-out, but [47:50] At Factory, we very highly emphasize collaboration, and I think, like,

47:55-49:32

[47:55] in AI in particular, [47:58] Everything has been done by groups of people. [48:00] And so it's hard to really think about one individual. I think physics, there are a lot of more like solo genius doing something crazy. [48:07] Um, [48:08] But I think a team like recently that I think we really admire at Factory is Mistral and how [48:14] Kind of quickly, they basically... [48:16] came into open source and brought those models [48:20] to basically the cutting edge in a super short amount of time. [48:23] And I think... [48:24] Yeah, I speak not just for myself, but I think all of our team really admires both the mission that they have and the speed with which they executed on that. [48:32] So. [48:33] Yeah, I would say missed rule. [48:35] Awesome. [48:36] All right, last question. If you had to offer one piece of advice, [48:40] to founders or would-be founders hoping to build in AI. [48:44] What piece of advice would you offer them? [48:46] We are in a land of picks and shovels. [48:49] And no one has struck gold yet, clearly. [48:54] I would say go for gold. [48:56] I would say. [48:57] Try to build something that you think is going to get 10x better if OpenAI releases GPT 6 or 7. [49:05] Internally, we think of our product as something that will multiply in value and uniqueness. When new models are released, [49:12] And I think for us, it's always like we were listening to the OpenAI announcement yesterday. [49:18] Right, and... [49:20] Everyone is excited. Everyone's pumped when a new model comes out, when open source does something great. If you're stressing about new product releases or demos, it might mean it's worth adjusting your product strategy.

49:32-51:05

[49:32] Beep beep beep beep beep beep. [49:35] Congratulations on launching Factory and beating state-of-the-art on Sweebench by such a wide margin last week. It's incredible. [49:41] Just for our audience, can you maybe quickly recap what Sweebench is? [49:45] Yeah, absolutely. And thank you. All credit goes to the [49:48] Factory team for making it happen. [49:50] Um... [49:51] Sweebench is a benchmark designed to test an AI system's ability to solve [49:57] real-world software engineering tasks. [50:00] So it's around 2,300 issues which were taken from [50:04] contributions made to 12 popular open source Python projects. [50:09] And typically these issues are bug reports or unexpected behavior that people reported on these open source projects. [50:18] And the idea is, [50:20] you know, all of these real world issues were addressed by other humans. And so you have a ground truth of what, [50:28] a human software engineer would do when faced with an issue. And the benchmark is trying to test [50:35] Can your AI system go through each of these issues? [50:38] and generate a code change that properly addresses it. [50:42] and comparing it to the human solution with tests that a human wrote. And so there's a lot of asterisks. [50:51] but it is a somewhat useful approximation of your system's ability to take natural language. [50:56] and then turn that into code. [50:58] And I think the previous high watermark on Sweet Bench was 14% or so from Cognition, Devin, until last week.

51:05-52:39

[51:05] And you put up a really impressive new result at 19%, which is such a wide margin. [51:11] This is such a competitive field right now and such a competitive benchmark that everyone is trying to beat, which makes... [51:17] you know, your results even more impressive. Could you maybe share a little bit about your approach and how'd you get there? [51:23] Definitely. [51:24] And one of the main reasons we were interested in Sweepbench is that there's a lot of companies and research labs that made submissions. You can see Microsoft Research, Amazon, IBM, ByteDance. [51:34] and [51:36] I think that's a testament to the Sweebench team's effort in making this benchmark a household name, which is great. I think one reason we were able to outcompete the kind of like well-funded tech giants and other AI cogen startups is that we're honestly not building the code droid for a benchmark, but rather to support real world customers. [51:57] And we've always said customers are the best benchmark. [52:01] And I think there's some great evidence for the success of that approach. There's a few areas our technical report goes into around... [52:08] planning and task decomposition, environmental grounding, code-based understanding. But overall, I think that the [52:15] The thing that matters most when your team's working on these types of general software problems [52:21] is kind of like, what is the North Star? What are you iterating against? [52:24] And so having kind of a real-world data set can make a huge difference. [52:30] And we just had Harrison on the podcast last week, actually, talking about cognitive architectures. [52:35] To what extent did prompt engineering and cognitive architectures play a role here in your results?

52:39-54:10

[52:39] I would characterize our research as continuously pushing the question of, [52:44] How can we model each droid's architecture to more closely resemble the human cognitive process? [52:50] that takes place during the task. [52:52] It's funny, we actually have internally been referring to the flow of data and LLM calls as the droid architecture. [52:59] basically since the first droid. [53:01] And when Harrison first wrote about cognitive architectures, [53:04] It really became apparent that [53:06] that concept cognitive architecture is a great mental model [53:10] for how to characterize the system's... [53:13] that have complex LLM interactions and data flow. And so... [53:18] For us, I think the meta problem of designing a good cognitive architecture is [53:23] is [53:24] balancing [53:25] flexibility with rigidity in the actual workflow. [53:30] You want very rigid entry points and certain common trajectories like error recovery need to be really consistent. [53:39] but then you want the flexibility and the dynamics during the majority of the problem-solving process. [53:45] And so it's a challenging balance, but I think it's one of the most interesting things. [53:51] problems when building the droids is how do you know [53:54] kind of when to add structure and when to let that, uh, [53:58] let the droid, so to speak, handle it. [54:01] Really cool. So every droid has its own cognitive architecture that [54:04] mirrors as closely as possible what the kind of human equivalents of that task would be doing. [54:09] Yeah, exactly.

54:11-55:56

[54:11] 19% [54:13] is amazing compared to Piracy of the Arts. [54:15] It also still feels quite far away from, you know, [54:19] reliable code droids that people will just trust to [54:22] run wild in their code base [54:24] What do you think is the threshold at which engineers will actually start to use these code droids reliably and just... [54:30] You know, let them run. [54:31] Are we there yet? Or what is the threshold? [54:34] Yeah, for sure. And I think that one thing to keep in mind is that the... [54:40] percentage on a benchmark like sweet bench is kind of like one of many possible measures because the the answer i think really is that they are already using it in production [54:51] But the use cases that might kind of highlight what the CodeDroid is designed for may not necessarily have a ton of overlap with what is tested in a given benchmark. [55:03] So if you take like human eval or some of the other coding benchmarks, [55:07] That maybe tests your ability to pass a coding interview, but it doesn't really test like your real world software engineering. Sweetbench, I think actually does test a lot of real world software engineering. [55:18] But in the particular context of debugging or kind of [55:22] you know, unexpected behavior identification. There are some feature requests and there are, [55:27] a lot of kind of not explicitly debugging style problems. [55:32] But... [55:32] Tasks like a migration, a refactor, a modernization that take place over multiple changes that oftentimes have humans very heavily collaborating are really a pretty different problem. And our internal evaluations are much more focused on those customer tasks. And so we have way higher reliability rates for those style of tasks.

55:57-57:34

[55:57] And I also think that a huge part [55:59] of the role of human AI interaction design, [56:04] is acknowledging where the systems are currently falling short. [56:08] and building into your interaction pattern [56:10] accommodation for the weak points of the AI system. This isn't going to 100 percent of the time, perfectly capture the intent of what you were doing. [56:20] So how do you kind of have [56:22] failure trajectory handling, how do you introduce the ability to kind of edit midway as the code droid is working, to observe and have kind of some interpretability into why a code droid is making a decision, so that when it does something, the human being actually can... [56:38] step in or at least understand what went wrong. And so I think that those allow you to say, well, we may not be at 100% on something like SweeBench, [56:48] But we can still use this and get kind of real productive gains in the meantime. [56:54] Totally makes sense. [56:55] And I hear you that Sweet Bench is not the be-all end-all, but since you have a good crystal ball into this space... [57:02] Do you have a prediction at what point we'll get to 80 or 90 percent on SweetBench? [57:06] I think that the... [57:08] the pace right now is really, really fast. Um, [57:12] there's a kind of like interesting question of, [57:16] Will we get to 80% to 90% on Sweebench, or will there be a better benchmark that kind of comes out? [57:23] before [57:24] we can really meaningfully start hill climbing past like the 50, 60%. There's honestly a lot of tasks in Sweebench which are...

57:34-59:09

[57:34] I wouldn't say impossible, but it almost feels like getting them right would almost only indicate that you're cheating. It's like they test for really, really specific claims or a string match. And so I think that before we see 80 to 90 percent on Sweebench, what we'll actually see is kind of like Sweebench 2 and Sweebench 3 that focuses on... [57:58] trying to think deeply about how can we evaluate when, uh, a, [58:02] you know, a piece of code that is correct, but also... [58:07] kind of ideal or useful for a given code base. The Sweetbench folks actually have a lot of really great thoughts about, [58:15] how to make these benchmarks better. But I think probably in the next two, three years, we'll see that. [58:21] Yeah, and they're Princeton guys as well, right? [58:22] Yeah, yeah, they are. We actually shared a thesis advisor. [58:28] No way. That's very cool. Well, you know, Matan, thank you so much for the conversation. Congratulations again on these results and on launching Factory. We are so excited. [58:38] Thank you. Thank you very much. [58:39] *music*

Want to learn more?

Ask about this episode