Empirical Work in the Age of AI
A Stanford IRiSS panel on how AI is changing empirical social science, with a reading map for economists who want the practical research-workflow lessons.
Why Stanford convened the panel
The organizers frame the event around empirical social science in a moment when AI tools are becoming relevant to graduate students, faculty, and research teams at the same time.
Read transcriptHide transcript
Welcome and Origins
Guido Imbens: Let's get started. Welcome, everybody. I'm Guido Imbens from the GSB and the Economics Department, and on behalf of the organizers, welcome to this event on Empirical Work in the Age of AI.
We have a very packed schedule, and I only have 5 minutes for this introduction, given the fabulous lineup of speakers. But there's a couple of things I want to say.
One is, this got started when I had a coffee with Rose Tan, as she suggested she give a talk to the economics graduate students on how to use AI tools. Rose had done her PhD here and had been working at some of the tech companies. She was visiting here at Stanford for a couple of months.
She felt the graduate students, and probably some of the faculty as well, were not taking this tool seriously enough. And so she thought it would actually be good to organize an event to get the students and everybody else more focused on these things.
So thanks to Rose for suggesting this. It got a little out of control in terms of the amount of interest, but thanks to Mike Tomz and Chris Fraga from IRiSS, and also support from Stanford Impact Labs, the economics department, and the GSB, here we are. All the potential speakers we approached were very enthusiastic about participating, and so we ended up with a great set of speakers and topics.
Logistics and Questions
In terms of logistics, given the schedule, we don't actually have a break, partly because there's so many people here, we've worried that we would lose a huge amount of time. But so when people come and go, please do so quietly.
There is an hour scheduled for Q&A, and you can submit questions on PollEverywhere using the QR code here on this slide.
Framing the Questions
Now, on to the substance. Where is this all going? I don't know, and I hope to learn more, but clearly this is not going to get settled today. This will be an ongoing process, but one that we should all be paying close attention to.
Will the quality of research get better? I do think so, and we will see some compelling examples of that today, and some of the tools that will get us there. Will the quantity of research increase? Possibly, and that may actually strain the current publication process and require innovations to the process of reviewing and screening papers.
Will the type of research and how we do research be different? Yes, but exactly how, I think, is very hard to predict, and hopefully we'll learn some more about that today.
So let me stop here and make sure we have time for the speakers. Mike will introduce the first speaker. Thanks.
Speaker Introduction
Michael Tomz: Hi, everyone. Welcome. My name is Mike Tomz. I'm a professor in the Political Science Department and the faculty director of IRiSS. It's been a pleasure to co-organize this with Guido and others, and I really want to thank the presenters for being here today to speak with you.
Let me introduce Rose Tan, our first speaker. Rose received her PhD in economics from Stanford. She previously worked as a data scientist at the Federal Reserve Bank of New York, at Quora, Facebook, and LinkedIn. This past year, she was a visiting scholar at Stanford in Stanford Data Science, and she just started a new job at Snowflake.
Rose, I'll turn it over to you. You'll have 25 minutes for a presentation. And we're going to save the Q&A for the last phase of our session today.
Live agentic workflow demo
Rose demonstrates how a researcher can start from a messy project, use coding agents inside a local folder, create reusable instructions, and quickly prototype analysis or review interfaces.
Read transcriptHide transcript
Opening the Demo
Rose Tan: Alright, so I have been asked to give a live demo. Live demos are known for things going wrong, and fortunately today, when things go wrong, it's actually a great opportunity to show you what Claude can do.
First of all, I'm curious how many people in this room already have Claude Code installed already? Raise your hand. If we get about 90% plus, I will skip the first part. Perfect. Okay, I think we are safe to skip it.
I just want to flag one piece of this, which is, so here's Claude, it was installed. It took some time, so I did this in advance, but, you know, it's in the terminal, there's a lot of bash commands. And you may not know that "cd Users/Rose/Documents/whatever" is what you need to do if you've never been in a terminal environment before.
And one really easy trick is just, okay, well, if I don't know what I'm supposed to do, I can ask Claude to figure this out for me.
And all I really need is Claude Code opened up, and some LLM, ChatGPT, Claude, opened up. And I can just use one to get me the other. And so, for example, here, I have a screenshot of, like, okay, so here is my folder that I want Claude to open up, and I just drop this into ChatGPT, and I say, "how do I open folder in Terminal?"
Because keep in mind, here, you haven't opened Claude yet. So first you have to get to the folder, then you open Claude within that folder. And ChatGPT will just tell you.
So, a couple things I want to demo here is just, one, it's very multimodal, like, oftentimes, a picture is worth a thousand words, and it's easier just to send the LLM the picture, rather than trying to describe it verbally.
The second point I want to illustrate here is no question is too small for the LLM. Like, this was a very simple question, I could have Googled it. And if there's one thing I want you to take away from my part of the talk today, it is: "Ask the LLM." Like, it is really just this one very simple thing, okay?
So we're gonna try this interactive exercise in which every time I make this motion, you will say, "Ask the LLM." Okay, so we're gonna try this. How did I figure out what Terminal command to use to open the right folder for Claude? Great. And so, this is really the only thing I want you to take away from my part today.
A lot of the things that I will be showing you are actually anti-patterns of how to use this. Generally speaking, I'm switching screens now, so this is why this is happening, so while I switch, let's see here. Generally speaking, just opening up Claude, and just being like, "I guess I want to do some research. Can you help me do some research?"
Like, it's not a good idea. And so I'll try to do my best to flag, like, what is a pattern and what is an anti-pattern in terms of something to avoid.
A Framework for Day-to-Day Usage
Rose Tan: So we'll start with this quick framework for day-to-day usage. This is a model. All models are wrong, some are useful, I hope this will be useful for us.
Level 1 is just talking with an LLM, like ChatGPT, in a web interface. Level 2 is having either Claude Code, Codex, Cursor installed, and using that. And Level 3 is kind of using these specialized features, like skills, MCP servers, Claude.md or Agent.md files.
And then the fourth is adding in some lightweight engineering best practices to help your code base be scalable, be reproducible. And then the fifth one is having these, like, really complicated multiagent environments that are doing long-running tasks on their own.
And as you move up this hierarchy, there's a lot more tooling complexity. The first one, all you need is really a web browser.
And then the second, you need to install Claude Code. Money is another one as you move up, and you do need to pay $20 a month, or else you don't get access. Number 3 is, you know, getting some basic software engineering practices in, like GitHub, version control, and things like that.
And then number 4, to have, like, these multiagents who are really running autonomously, you need at least a Claude Max plan, and with the recent changes, you may need even more than that.
And so, for the purposes of this talk, we already went through installing Claude Code, and then I want to talk about levels 3 and 4 here. I think anyone who can get to level 4 with enough time can get to level 5 on their own.
Starting a Replication Project
Okay, so this is Cursor. For my purposes here, I'm just using Cursor as a file viewer. This is my Claude Code. This is a brand new folder. There is nothing in this folder right now. You can tell that there is nothing because there is nothing in the Cursor window. And now I've opened Claude in this folder.
Now, the first time you open Claude, you may be like, well, I don't know what to say. And so, you can just start with something really simple, and say, like, "Hey, I am a PhD student-" Oh no, there's no internet connection. So we'll let it connect.
While it's connecting, I'll let you know that, like, a lot of people have found it to be more efficient to talk to them verbally than to type. And so, there are many tools out there that let you do this, and so I just hit a button, it lets me talk to it.
"I'm a PhD student at Stanford in economics. Show me some things that you can do." And so, you know, you can fill it in with your own department, maybe you are a faculty, a researcher, a graduate student in the political science department, whatever it may be. And it will just tell you what it can do.
And this is really nice, because if you're starting with it, you may not know what it can do.
For the purposes of our demonstration today, what I'm going to ask it to do is replicate this Lalonde paper, which, if you're an econometric student, you've probably heard it before. I will go ahead and get it started, and then as it's going, we can talk more about it.
This is great. "I am an econometrics student, I would like to do a simple replication of the Lalonde paper. Please make a plan for how to do this. Do not write any code. Ask me any questions you have."
So when we're talking about patterns and anti-patterns, I think the first prompt I gave it, which is, "hey, tell me some stuff you can do," that's good for exploratory work, but it's not really how you would want to be using it on a dayto-day basis.
On a day-to-day basis, whenever you're starting a new project, a really good pattern is, "Hey, Claude, here's some stuff I want to do. Can you make a plan to do it? Don't write any code, ask me questions."
Now, in some ideal world, you would go write an extremely long and detailed and thorough and rigorous prompt. However, that's a lot of work, so we should...? Yes, we should ask the LLM to write the prompt, because it will be far more diligent and detailed than I would ever want to be, you know?
And so here it says, okay, great, like, here is all this stuff, right? Here's a proposed plan. I do recommend you read the plan. This is where, like, an ounce of prevention is worth a pound of cure. I'm gonna skim through it right now, an anti-pattern, right? But I'm just gonna go with, "Yeah, fine, this all looks good."
And so it has a few questions, and so I need to make a couple of decisions. For language, let's use Python. For example, let's do the famous one. For comparison group, just do something simple. Output format, markdown is fine. And then for depth, let's keep everything simple, and let's try to do this quickly.
So, typically, what you would want to do is not what I just said. You want something that is rigorous, replicable, and something that is, you know, you're doing research here, right?
One big bottleneck when doing these live demos is speed. And so, for example, I tried replicating another paper earlier this morning, it did a great job, but it was so slow. It took forever for it to go find the data, even though the data was available online.
And now it has even more. I'm just gonna go with the default suggestions. This all sounds good, just make sure that whatever you're doing is quick. If something is slowing down, feel free to stop and ask me questions. And so, like, clearly you can tell I'm just asking it to focus on speed and go, go, go. That is not how you would want to do it.
But a couple of important things here is just that you see this collaboration between me and the agent, where it's like, okay, you told me to go do this, I have these questions. For those in the room who have worked with RAs, it can feel a lot like that, right?
Where you're like, go do X, and they're like, here's all my questions. And so it does feel like that type of dynamic a lot of times.
One note here is, a lot of times, it's going to be trying to use commands that you may not be familiar with. An example one here is curl, and sometimes you may wonder, like, oh my gosh, is this command just going to erase everything on my laptop? Because that would not be good.
And so security is always an issue when dealing with these. And so when you have a command, and it wants to do this thing, and you're not sure if it's safe or not, what should you do? Exactly. So you can either ask in here, you can pull up another one, and that way, you know that what you're doing is safe. There are safeguards built in. They have, like, prompt injection protection and things like that, but it's never going to be foolproof.
And over time, it's also a good way to develop your own computer science skills. I think that they, you know, even though LLMs can do everything, having some of that mental model in your head can be really valuable.
So as you can tell, I'm basically just hitting enter. This is probably an anti-pattern, you should probably be reading some of this stuff. I, you know, like, I have done this before this morning, I know it can do it if it just keeps trying, so I'm just like, enter, enter, enter, go, go, go. Let's see.
Reviewing the Generated Outputs
Rose Tan: So, since it's going here, I want to show you what it has created. And so, if you... Before, this was an empty folder. It may be more obvious-let me just show the regular view in Finder.
So these are the... it used to be an empty folder. Claude has found the information from online, downloaded it onto my laptop. For those of you who are in economics, not surprising it was stored in a Stata file. It didn't have any trouble reading in the Stata file and converting everything it needed into Python.
And so then it created this Markdown file as the output, which is what I asked it to do in the beginning.
Yeah, I don't think the Zoom is gonna work that well, but I think this Zoom works decently well. Can people see this? Okay.
So here, for example, is the write-up. This is the agent panel and Cursor on the right. I'm gonna close it, because we're just using Cursor as a file viewer today.
And so this is a Markdown file. If you do Command-Shift-V, it turns into something that's easier to read. And, you know, as if you're an econ student in particular, you're like, I don't deal with Markdown files, I really want things in TeX, I want it to be a Beamer presentation, I want it to be, you know, things that are effectively, like, socially acceptable in my field. And so, how do I transform this?
How's that come? Right, and so this is, like, transforming things from one type of thing to another is something that LLMs are exceptionally good at. This could be code from R to Python to Stata, vice versa, you can grab an R package, have it write a Python package out of it. It could be different types of content.
No, this is good. Let's stop here. Can you create a LaTeX file? I want both a PDF and a Beamer template.
And it can take whatever piece of information and very, very quickly and very accurately transform it to whatever other format that you want. And so here, moving back into this folder, you'll be able to see it writing new files. So, here it had some summary statistics and a CSV file, JSON file with the results.
This is the Python file, where it wrote a ton of code, well, not a ton, 96 lines, that's not bad. And then these are the original...
And so here, it's now creating the TeX files, so for those of you familiar, you know, you've probably seen this type of thing before.
Skills and Claude.md
And then I want to finish up by showing you a few of the components. I think that pretty much summarizes level 3, and I want to show you a few components from level 4, which are skills and I probably won't have time to go through MCP servers, but I'll do skills in the Claude.md file.
Okay, so, first question. What is a skill? Yes. So just ask the LLM to make the skill for you, okay?
So, alright, this is great. I want to push this to GitHub. Can you please create a GitHub skill that will allow me to push everything with just one command?
This is actually, yeah, no, this is, I think it will work. The key here is the connection to GitHub. That requires an authentication key, and so if it's not set up already, I would have to set it up. The skill can't do that for you, because you actually have to log into GitHub and generate some stuff there.
But in general, skills are just text files, and so, they're usually hidden away, and so, for example, here. Yeah, again, this is an anti-pattern, right?
Everything is just in one giant folder, and it's a mess. But once it creates the skill, I'll show you it. And it's just a text file, and skills are abilities that an LLM has that enables it to do specific things more in the way that you would like it, as opposed to whatever it wants to do.
So it kind of, like, takes it from, it can do whatever it wants to, like, no, no, no, no, this is, like, specific, specific instructions for how I want you to do it. And I think future speakers will share all sorts of skills related to social science research. I think it'll be really valuable. Okay.
So, it has created a skill, and I can't see this skill in this particular folder. And so, hey, I... you put this skill in a different folder. I actually want it in this live demo folder.
And so, if it doesn't do something you want it to do, you should just ask it to do the thing you want it to do. Usually it works. Sometimes it's gone really haywire, and the best thing to do usually there is I just completely restart a whole new session.
So here, now you can see the skill. It's in this.Claude, which is actually a hidden file, so you won't be able to see it by default in the Finder. And then this is a skill that GitHub wrote, it says, hey, look, if you want to push everything to GitHub, this is how you want to do it.
And then I can invoke this skill. Rather than dealing with GitHub commands on my own and typing them in. And so this is, some skills you can invoke with a slash command.
And then the other thing I want to show you is the Claude.md file. Great, so now you have a sense of what my preferences are. Please put all these preferences in ways I would like to work in a Claude.md file. So, whereas skills give the agent capabilities and specific instructions, the Claude.md file is actually more of, like, hey, this is how I would like to work.
This is something that is probably going to be custom for every person, and so, for example, here, I had told it, look, I don't want you to put skills in this random folder, I like skills to go in this folder. And when I make a new project in the future, I would like Claude to remember that.
And so, the way to do that is to add it to this Claude.md folder, and then when I create a new project, I can actually seed that project with the Claude.md folder, and then it will instantly then know what my preferences are, so I don't have to tell it again in the future. So here's the Claude.md folder.
What it learned about me is I am a PhD student in economics, I like causal inference, I like for it to plan. I like for it to use Python and LaTeX and these things, right?
And so over time, it can grow into a very large file that can help you jumpstart future projects. And so this is where these things are building on itself, and over time, you can really accelerate your efficiency through the agents being able to learn from session to session. Let's go back to the framework. So, we have covered Claude.md files, skills, we covered a little GitHub, a little bit, and IDEs, Cursor and IDE.
Building a UI
I want to show you some things that might be helpful for you that you may not think about that LLMs are really good at. So the main one is building a UI. I have no idea how to build a UI. I've never coded a UI by hand manually, but agents are insanely good at this.
And so, usually when I do research in grad school, I would never code up a UI just for fun, because that's a ton of work for something that's probably not that useful. However, agents lower the cost of doing this work, and so, please build a UI for some of the Lalonde results. I want to be able to interact with it. The Internet is gone.
That is one issue Claude cannot fix, because it must be attached to the Internet. "Build a UI for the Lalonde results. I want to be able to interact with it."
And so, for example, you can think of interactive tables, you can build UIs that let you, maybe you have, like, 10,000 different graphs, you could build a UI where you have every graph, and you just click check marks for whichever graphs you want to select. There's a lot of different ways, and so just pick something and make it simple.
An antipattern, right? Like, ideally, you should have opinions on these things. I generally, with UIs, have not too many opinions, and I want to try to make sure we're keeping within time here.
And so it says here, okay, I'll default to Streamlit, which seems perfectly reasonable. If you wanted to do some other way of having a UI, that's fine too. You could have just an HTML page. There's a lot of different ways to do this.
And I think I will flag one thing, since you are a group of PhD students. So, I think that PhD students tend to think a lot, and then do. And with LLMs, it's often easier to do and then think, because doing is so cheap now, and then a lot of times you do the thing, and you realize, this is not what I thought it would be.
And so, for example, with building a UI, you could imagine going online and looking at, oh, what's the best UI that I could possibly come up with?
Or you could have, like, a back-and-forth, 100-thread exchange with Claude about what's the best, you know, computer science stack to use to build a UI. And generally speaking, I think it's better to just try it out, and then iterate, right?
Another inclination you may have is, let me go read about LLMs, and how the model works, and listen to all of Andrei Karpathy's videos before I start using Claude Code, right? And that all makes sense, but I think with these types of tools, if you had 20 hours, it's better to just spend 20 hours interacting with it, and you'll learn that way, rather than Okay, so here, it's like, oh, look, it's running in the localhost, you should just do it, and but I don't see anything on my screen, and so what should I do? Yeah.
But, like, okay, open this. So here's the app that it generated.
And by the way, all of this stuff may very well be not correct, for those of you who know Lalonde, so I'm not guaranteeing correctness. I did not do any auditing or validation, and I moved through this way too fast. But my point is, here's a UI, and I can kind of select different covariates to use, and it'll change the estimates.
I can reset, you know, there's, like, I can do different estimators.
These are things that, before LLMs, we may never really think to do with research, that now just become so cheap and easy. And so, going back here, you'll see it's, you know, it created all this stuff in here, for the UI as well. And so, yeah, I hope that's helpful.
I think you know my main message from today, which is, thank you, and you can really do it for any question, big or small.
Transition to the Next Speaker
Michael Tomz: Thank you so much, Rose. This is working now? Yes? Great. Our second presenter is
Replication, software, and project structure
Yiqing treats agentic coding as a major workflow shift for empirical work: reproducing papers, developing statistical software, separating text/code from large binary data, and preserving human skill through deliberate practice.
Read transcriptHide transcript
Introduction
Michael Tomz: Yiqing Xu, he's an associate professor of political science at Stanford. His research focuses on statistical methods and, in particular, causal inference with panel data. And he also does research on comparative politics with a focus on China. He's joining us remotely from JFK Airport in New York, so this is another live demo. We'll, make sure that this works. He is going to share the AI-assisted workflows, that he's been using to reproduce empirical work and to develop statistical software.
Hi, Yiqing.
Opening and Agenda
Yiqing Xu: Hello everyone, can you see me well? I actually cannot see myself, so my face must be super large, in the room. So I will talk about, three... two or three AI workflows. First of all, let me thank Mike and Guido for organizing this. Thanks, Rose, for this great demo and having this great idea.
I saw that Twitter... three days ago that, they say Twitter... that's it, they say Twitter is a group therapy where nobody gets better. So I hope... I hope this is a good separate that we actually learn from each other and also feel slightly better afterward.
Alright, so let me try to move my screen. All right, so I think as you've seen, you've tried, that this agentic AI with Claude Code and Codex is, to me, it's another ChatGPT moment. probably bigger than ChatGPT. That also makes us sleep-deprived, because it's both so exciting and, creates a lot of stress. But I think by trying more, the stress level will remain high, but stabilize.
That's... that's what I think.
The skills, as, Rose described, is... to me, it's like a structured, user and system prompt. You can actually see those markdown files, and they will be attached to specific tasks that you ask LLM to do.
And... I think the key to the power of this skill and skill files is that it essentially solves the context window issue, right? So, you probably have experience of converting with ChatGPT in the chatbot, and as you talk more, and we'll have more and more rounds of, conversation get slower and slower, because the context window grows.
But skills breaks it down, agentic workflow breaks it down, so every time you have a specific task, you don't need to, so that the system will does not need to attach everything before to the, to... and send it to LLM, which makes it super, scalable.
So it's basically, as you've just seen, that, from ChatGPT or other large language models being able to ask... answer better questions to do things for you using various kind of tools.
So the workflow I want to, demonstrate a little bit, and especially reflect on, are two things, three things. One is a large-scale replication workflow I was working on since February. And the other one is a workflow that I try to maintain and create packages for myself, and also maybe can help others.
The last one is my own thoughts, maybe tips on building your own workflow, because I think, as you just saw. You've just seen from Rose demonstration that we will have to build our own workflow that is customized to our own needs.
All right, so I think these three workflows have some commonality, which is, they're structured. They're verified... they're verifiable.
I think the LLM, or agentic workflow, works the best when it can actually actively, in real time, check answers and correct themselves. And it, hopefully, it's also not idiosyncratic, it's iterative, and it's repetitive.
So I've tried to use OpenClaw as well, but I don't find it very useful. I stopped using it because our job, it turns out, is very idiosyncratic and very complex, but some elements of our job are repetitive, and maybe can be automated with more efficiency.
So the Stats Claw and the replication are two examples.
Large-Scale Replication Workflows
All right, so the first, first on replications. In the past few, years, actually, Mike says I'm an associate professor, not associate professor until July 1st, but Fortunately, I passed that hurdle right now, but before that, I've been working on several large-scale replication papers. So the reason why I replicate those papers from, like, 20 papers to 70 papers is because I want to understand how relevant new, methodological statistical methods, are to our empirical application, right?
Some of these methods are invented by other people, some of the methods are... I'm involved, I understand whether they matter, right? And the other reason is because Only when you start to show and make a difference, people pay attention and you can improve practice. That's why I'm...
I was trying to do this, and I think this kind of work It was valued, in the field, which I'm... very thankful. But with AI, now I start to question the value of this type of work, right? Because it's now so much easier to do.
So I want to make my own self obsolete as soon as possible, so that I can move on to more exciting stuff, and maybe focus more on methodological innovation instead of simply replication. you've seen one replication, from the launch, but I can tell you that, In order to do good research, two things are important. One thing is that you want a very high level of accuracy, right? Maybe 99%, and address the last 1%. The other is that if you want to replicate other people's code, as humans, you probably have done that.
It's very painful.
Reliance is one of the examples where they are very well curated on GitHub, and you can directly get, the long data from our package, but other data you could just directly download from ICPSR, or AEA Archive, or, Harvard Dataverse. it's very heterogeneous, right? You have Stata, PySong, R, and people have different type of codings, coding, habits, and... And working on each of these projects.
The most of the time, I can tell you is spend on harmonizing those packages, those replication packages, so that we can apply a template to those packages and then better, the methodological innovation or existing problems in the practices.
Okay, so then with my, collaborator and former student and, Leo, and we built this, workflow, which is fairly, sophisticated right now. This is version 2, we have version 3 right now. So you can think of them as a three-layer system. What I like this... what I learned from this process is that you actually, in order to do good scientific research, you actually want to minimize the LLM involvement, right?
We want to ask LLM to teach us skills, which is totally legit, but LLM will, because of the probabilistic, nature of such models. If you want to have it reproduced up, you want to have version controlled deterministic code, which are a bunch of Python files, right? So you can think of LLM in this workflow to play two roles. One role is to semantically understand this paper.
In order to reproduce paper end-to-end, you give a PDF, and then you extract replication material, and then start to run code. First, LLM needs to understand the paper, right? Understand which number correspond to what quantity. The other is that LLM needs to play a role of software developer to update those deterministic codes. The next time. next morning, you close your conversation, but these codes, because they're version controlled on GitHub, still can produce the same results. As you've seen last night.
What's really interesting in this pattern, it is like a three-layer system, is that in the second layer is actually a knowledge accumulation layer, where a bunch of markdown files are the skills, right? Skills are just markdown files that accumulate various kinds of weird cases it has seen in those replication packages. And you can see them. After seeing those markdown files, I... I don't think this is humanly possible to do reproduction or replication in large scale. You can do, like.
70 papers, 100 papers, maybe, with a large team. I4R did the 100 papers, with 200 quarters involved, or maybe more. But if you want to have AI workflow to scale it up... sorry, if you want yourself to scale it up, you have to rely on AI. And it's perfect for AI because the results are verifiable.
Just to show you, give you some sense that we are able to, reproduce regression tables in almost, like, 400 papers, now, and this... we randomly draw 784 papers from all the politicized top three journals, since 2010, and these, pattern is very interesting.
It's also documented in the, recent Nature paper, which shows that because of the A major, data... archival requirement and verification requirement pushed by a bunch of polythi journal editors, the reproducibility shown by these green bars increased dramatically. But This green bar actually involves a lot of work, because for each paper, we dig into the replication files, we run them, we harmonize the code, we match the numbers with the numbers, return by the software.
It's the heterogeneity that makes this job really hard. If all the codes are written by AI in the future, it actually will be much easier. It's a human element that makes this job more difficult.
So another thing that I find may potentially change methodological research is the abundance, or in the near future, very near future, the abundance of harmonized replication files, harmonized data sets, right? We know in the CS, we have a lot of the Standard data sets, which, greatly accelerate DS research like ImageNet, but in... because of this kind of workflow, I think we will be able to acquire more and more such data sets.
This is... this is a paper on the left. This is a pattern that, after replicating 67 papers for 4 years, published of... I was very fortunate to be able to publish this paper before AI. So what this shows on the bottom left is that, interestingly, you see this... there is a discrepancy or difference between two-state lead squared and the ORS estimate. And this discrepancy, this ratio is decreasing, with the first stage strength, captured by rho. This is first stage correlation coefficient.
And this is not theoretically expected, and actually, in experimentally generated IV, you don't see that. So our interpretation, which is also documented elsewhere, is that it's because IV is a ratio, because you have unconfidence and... or exclusion restriction violations, such biases will be amplified by the first stage.
That's what we documented, and we want to show this to researchers to give them caution, right? And this took us 4 years to do, and we did this with the pipeline, and this took us 3 years to... sorry, 3 days to extend the sample to 92 papers with 200 specifications, very similar patterns, but much, much smaller human, labor involved. So I think we can do such work at scale much more easily with AI.
Statistical Software and Package Development
The second thing is, is another... I'm thinking about how to replace my old self, right, to do more exciting work. Another key bottleneck for my workflow is maintaining and developing packages. So I and my collaborators maintain over 10 packages. Some of them are R, some of them's data. which is dying, I think. And some of them are Python.
The reason why Stata... I'm not very optimistic about Stata is because I think it's not very, very AI- native, it's closed source. AI cannot learn very well, and you can use a MCP, but it's a bit complicated. But how to maintain and develop such packages at scale? That's the question I want to address, right? Of course, we know Claude or Codex they're just very good software developer. Well, it should be a very simple problem, right?
And I think there's another thing that is... but all the... a lot of... most of the training data is on software development, not particularly on development of statistical routines. But what's nice about statistics is that we have Oftentimes, we have proofs, we know the properties of this estimator, and then we know what should be expected, right? And we can take advantage of that.
So then, the simple idea that I had is that maybe we can parallelize, we can spawn agents independently with distinct set of information. The builder, the agent that actually builds the package does not know the true parameters that simulator gives to the data.
And a tester, which is essentially deterministic code. empowered by AI to check whether builders... the package built by, or the routine built by the builder. actually can generate, can return an estimate that is close enough to the DGP that only the simulator knows. This turns out to be very powerful. Extremely simple idea, as shown here. I think this will be absorbed very quickly by those large language models, because they are agenic, right? So it's a very simple idea.
Probably the next version, you don't even need to have this particular workflow.
By the way, this is now a, Claude plugin with OneLive code in Claude, or you can ask Claude how to install it. Look at this website. OneLive code, you can... have all these agents set up in your workflow to write packages for you. So whenever you... it doesn't have to be a package, it can be a set of structured code for your empirical project as well.
Alright, so we tested, we have... we documented a couple of examples in... in the archival paper, but I actually tried to... tried it for a real application, right? So what we did is we wrote this paper with... I wrote this paper with my colleagues Avi and Jens. We already had the prototype by my colleagues, so I just... made this package with that... with this particular workflow, that quad. with a few shots. The first shot is already very good.
Again, as Rose mentioned, that you want to discuss the plan, with... quite thoroughly, maybe have a conversation with the... with the AI, to plan it out, and then it will automatically create hundreds of tests to guardrail this entire process, and then... It takes, like, one hour, and then the next round is much faster, next round is about visualization.
So, it's already online, you can check it out. So I think this bottleneck of maintaining And, developing packages is no longer a bottleneck. It will be very, very cheap to do, because it's simply software development, right? And what's empowering is that now, without much resources, as an individual researcher or a small team of researchers, you can do this very cheaply. You don't need to even hire people or, spend too much time on this. And then we can focus our time on more interesting foreign problems.
Building Your Own Workflow
All right, so lastly, if I have time, that I cannot see the clock, I also want to share some thoughts, or maybe tips, on building your own workflow, because I'm involved in building these two workloads, which are fairly complicated. I also want to automate other parts of my, work as a, like, validate, my citations, or taking notes, or things like that. You will have your own application, because everybody's habits is different, right? First of all. As I said, there's no one-size-fits-all.
I think the best way, and I find the most comfortable way, is to do my own work, as usual, but try to replace tasks one by one. So you can learn from the internet, you can try to build skills, or directly import skills from people on Twitter, or on GitHub or GitHub.
Usually, the rule of thumb is that if this is a task that is useful for computer scientists, or for business people, it's already there. The skill is already there. For example, digesting a PDF into Markfile, into TXE file, there must be a skill that's very well written, it's already there, and proven to be working. But however, our job as researchers, Stanford researchers especially, are quite idiosyncratic and complex in a good way.
So my... my feeling is that Oftentimes, the ingesting cost, right, to try to make other people's skill files work for you is actually much bigger than developing the skill yourself. So what I do is basically start from a painful task, and then after a few rounds of... well, like, a dozen rounds of a few dozen rounds of conversations, ask AI to reflect on what we have achieved, right? So, in this session, and then ask it whether it's worth it to make a skill.
If it's too idiosyncratic, maybe not worth it to make a skill, and then it will give you... it gives you their thoughts. I don't know which pronoun we should use for it to describe AI. And then we can iterate. Gradually, after you're done with a few tasks over a month, you will have a pretty structured and usable workflow for your own.
All right, okay, so I think most people should start with the $20 version of Claude Code or Codex. Each one is fine. I don't... Codex is also very capable. And if you hit rate limit, you will very quickly hit rate limit. I think a $100 worth every penny right now. I know $100 is expensive for students. I think the departments or institutions or universities need to think about how to provide institutional support for students, but I think it works for every penny. Right now, it's heavily subsidized.
People have made calculations. the estimate varies. Some people say the $200 version actually equivalently means, like. $2,000 to $5,000 of token, depending on the hour of you do your work. Yeah, but you will be able to create your workflow, and you can... you don't have to rush, right? Just to...
today you have a task, try to have AI to be part of it, and then make a skill, and tomorrow you have another task, and gradually you will be able to make it.
Sharing, Syncing, and Personal Workspaces
Another thing I find very interesting, I'm not sure how many of you are already Having already encountered this problem, is that what to be shared, what is your own, right? Oftentimes, we work with collaborators. What should be shared, and what should be GitHubbed, and what should be put into Dropbox. I know there is a Claude version, you can put everything on Claude, in Claude. But that version doesn't work for me, because I cannot see the file structure very transparently.
And also, because they use Ubuntu, it's fairly lagged behind on certain packages, like in R or Python, it's a little bit lagged behind. So I don't like to use their cloud. system to do this. I want to... so what I do is everything is local, but I use GitHub and Dropbox to sync. to sync the files and share the files.
So I think this works for me, maybe there will be other better plans in the future.
So I will have a personal AI brain. That is my main workflow, but for Stats Claw, it becomes public, open source, so I basically make it a plugin. Now I can directly plug in that workflow into my AI brain. But most people will have their own workflow, and that is a private repo, unless you are you want to share everything with the world. You can make it public on GitHub. And these are mostly what computer scientists call plain text, right?
Those are the markdown files, those are some of the Python codes, those are the Claude. MD, and they're easy to maintain using GitHub. And you will also have a lot of binaries, slides, data. Those are not suitable for Dropbox, for GitHub, right? Because those files are large and not They're not plain text. And you also may want to share with your co- authors, right?
So I would put these into Dropbox folder, and then use symlink to establish a fake... a link to your work workspace, because Claude mostly works in the first personal workspace, which is your AI brain, and this symlink, you can ask Claude what symlink is, is essentially a mapping from the folder in the Dropbox that is synced through Dropbox to your own workspace, which is where your AI brain lives. And same thing for Overleaf. Overleaf, essentially, is a GitHub system in disguise.
Every Overleaf folder can be Git, right? Git maintained, it's version controlled. You can go and push, coordinated with your co-author. That's what I... what I do with my co-authors. And... Also, if you have code projects, if those codes are mostly plain text, codes are mostly plain text, right?
You can also use GitHub public or private repo, and then sending to your workspace. That works for me. Maybe in two weeks, things will change, but this addressed the problem of sharing data with my collaborators. This also addressed the problem of changing computers. I go home, I don't carry my laptop back home. In my home computer, I just pull everything from GitHub or Dropbox or automatically synced, then everything will be ready. And the conversation will change, right?
The AI will write handoff. as a markdown file to the next... to give it to the next AI, which is the same model, but different conversation, so there's no short-term memory. The memory is transferred through these markdown files.
AI Burnout and Adapting as a Community
Okay, last thing. The AI burnout is real. I feel it firsthand. I personally have 3 sources of anxiety, right? First, you will feel behind, it's stressful, and because the field moves so fast on a weekly basis, maybe daily basis, like, Claude's pushing out so many products on a daily basis, so is OpenAI and Google. You also question the meaning of your...
of your work, past, current, and future, right? That's, I'm... I question the value of my work, so that's why I want to make my own self obsolete, and I can move on to new things. And you will also feel overly stimulated, right, because it is so powerful, give you so much positive feedback. I even start to develop a withdrawal effect if I don't use it. So... so I think... I think we need to take care of ourselves. It's very real. At least for me, it's very real.
Yiqing Xu: My current answer, I talk a lot with my CS colleagues, some of them are more... even more pessimistic than I am. The current answer is, one is that there's a long tail of scientific inquiry, especially in social science, maybe also in sciences, that AI or those AI companies are not focusing on. And AI right now expands what's doable for us. So, for the next few years, I don't know about 5 years later, but maybe for the next 2 or 3 years, there's a lot of new problems that we can't solve.
But before this, we cannot solve. Two months ago, we cannot solve. Another thing is that this may be a little bit, pessimistic, that even AI can do our job better than we can. I think the pursuit of knowledge itself is very rewarding. It's like, go players, knowing they cannot be offered Go.
there's no way, but they still, they go for fun, and they actually, actively learn from OperaGo. That's a little bit pessimistic, but maybe the pursuit of knowledge in itself is very exciting. All right. Anyway, so I think Guido is right, I completely agree with him that, The tsunami is already here. It's not 3 days later, it's already here. We can't do fantastic things, and if you don't, other people will do fantastic, and we have to adapt.
And we, as a community need to think about, how to evaluate good work, how to do good work in the first place.
And another thing is that, as a teacher, I'm thinking about how to prevent skill atrophy, and also for myself, right? So I... now it's very hard to work without AI. So I think my... I think for production, for producing research work, it would be ridiculous or even malpractice not to use AI, right? Just like the person in this picture. But also, I think we need to self-train our... keep training ourselves, even with resistance, without AI. Otherwise, we will lose our functionality as humans.
All right, that's all I have. Thank you.
Transition
Michael Tomz: Thanks very much for that presentation. I'm glad to have you in the political science department as my personal AI trainer. Come to you for, for questions. It's my pleasure to introduce our third presenter, that's Matthew Gentzkow. Matt is the Landau Professor of Technology and the Economy in the Department of Economics at Stanford. He's also the faculty director of Stanford Impact Labs.
His research focuses on the economics of media and technology industries, and how those industries affect democracy and civil society. Matt will be sharing some practical advice for the PhD students in the room, especially, about how to be a PhD student and what to learn in the age of AI. Thanks.
Michael Tomz: Right.
Research judgment in an AI-saturated field
Matt argues that social science remains valuable because the hard parts include taste, credibility, institutional reputation, and knowing which questions deserve serious evidence.
Read transcriptHide transcript
Framing the Moment
Matt Gentzkow: Thank you, everybody, for coming out. Thanks to... you know, to Mike, who everybody's part of organizing this, I'm just... incredibly glad we're doing it, because it seems really fun, really important.
I am, in some sense, like, the last person who should be up here talking to you in terms of, you know, how close am I to the technological frontier of adopting AI in my workflow? not that close, like, some people in this room may be desperately racing to catch up, and I don't have any particular knowledge or expertise, that is necessarily any better than yours. Also. Kind of prognosticating about the future. is something that I generally avoid studiously, and don't consider myself to have any particular skill in.
But, I thought at this time, just given how interesting all of this is and where we're all sitting, I would try to share some thoughts, just sort of like. putting myself back in the mindset of being a PhD student, thinking about what it is to be a PhD student right in this moment, and... some kind of thoughts that have occurred to me as I think about that. So, free disposal, but this is just kind of my musings for PhD students thinking about AI right now.
And I would start a little bit where you left off. I think, like, there's a lot of uncertainty right now. There's a lot of change, there's a lot of anxiety. Like, are we all just gonna be engaged in something that is kind of playing go for fun? Even though the robots... I'm good. better than we are. What does that mean for your futures, your careers, your economic prospects? And so, I think all of that anxiety is real, and we need to kind of be aware of that. I would also go out on a limb and say I think this is...
Probably the most exciting moment ever. to be a Stanford PhD student in the social sciences. And... a couple... I use the word exciting here, and not necessarily, like.
easy, or comfortable, or, you know, there is a lot of uncertainty, there's a lot of anxiety, but I think the potential For both what you can achieve in your career and the impact you can have on the world. right now is larger than it's ever been before. Certainly the right tail Of what? the most creative and talented students in this room can achieve... are going to achieve, all of you. That's why I put Stanford here instead of just PhD student.
That's just, like, a stand-in for especially talented and creative, which is all of you.
Okay, so... why... maybe this is kind of a crazy thing to say these days, why do I say this? Well, let me just make a couple of observations. So, one. I think it's really important in this kind of... like, we've been doing a lot of tactical stuff here, which is super important, like, you were trying to replicate Bob Olan's paper, how do you do that efficiently?
Like, and those details, what do you do every day, how do you make your workflow better, how do you get AI agents to read your email for you, that's all super important, but I think it's also a really important moment to kind of zoom out and think about what we're doing, and if the whole... Broad purpose of this enterprise. Is meant to be... Addressing important challenges and needs in society, and making people's lives better.
I do not think there's any credible argument that we no longer need to worry about that. That that's no longer an important enterprise. There's, like. A very long list of questions that are the questions that people in this room and our colleagues all study. That are more important, more critical. Than they've ever been before, and where the need among policymakers, among decision makers, among private firms for credible, reliable. high-quality. Science.
To guide those decisions, to help them figure out what to do, to help us figure out what to do, to help us solve these problems is bigger than it's ever been before. I don't see any problem of lack of need for the enterprise of social science. in the world right now, and I think that's going to remain true even as AI accelerates.
Second thing is, if we have all of that need. The tools, the firepower that you guys are all... Learning, building, developing, to address those needs, is growing exponentially.
So, whatever, if I went to grad school and thought about how is this social science stuff that I am learning potentially going to address problems in the world. the tools that I had as a graduate student to do that You know, exponentially less powerful than the tools that you have already, and that you will have as you go through grad school and go through The next phases of your career. And so I think, you know, one general theme I have in thinking about this is as the...
As the power of the tools we have and our ability to do stuff grows. The importance and return to thinking about what we are doing with it and why. goes up.
So if you have a really, really powerful thing you're pointing, the coefficient on, you know, how does social utility change, or how do your career prospects change if you point it in the right place and the wrong place, gets really... Big. And that pointing of the firepower, we might say, like, well, there's still nothing for us to do because the AI will figure out where to point all that firepower, how do we use all these tools, how do we make the world a better place. That may be true at some point.
But I think in some relevant medium term here, that... I believe is gonna last for a while. In order to deploy that power effectively, we are going to need smart, innovative, creative. humans... To do that. And that's gonna be you guys. Plus, on top of everything else, this work is just getting a lot more fun.
I have not heard a lot of people in the rest of my career talk about Withdrawal from... the fun experience of, like, writing our code. That was... right? That was not part of the thing so much before. And I think it's, like, very real. You watch Rose up here, like. you know, chatting with Claude, voice commands, like, oh yeah, why don't we just, like, fix all that stuff? Yeah, could you... I'm busy, could you just figure it out for me? Like...
It's, like, a completely different experience of what the minute-to-minute day-to-day of research is like. It's, like, pretty fun. I find it really fun. So...
You know, that's not so bad. Okay, so with that kind of introduction, let me just talk about a few things. One, Maybe give you, like, a little more of... justification for why do I say, I think. Deploying all this firepower to address Social needs and social problems is gonna require humans At least for a good long while. Second, if I was a PhD student today, thinking about what kind of human capital to invest in. what might I be thinking about?
What are the kinds of human capital that are likely to be To remain really valuable in the future. Or to get more valuable in the future, versus things that may be less so. And then I'll end by summarizing that just with A little bit of advice.
Why Human Judgment Still Matters
Okay? So... Why do we need... humans, now. And for some time. Just, like, a few different... thoughts that feed into that for me, so... you know, we're in this world where everything seems to be changing super, super fast, and just extrapolating things out, it just feels like... like, what are any of us gonna be doing? If we extrapolate this trend out for 6 months, or a year, or 3 years, or 5 years, like, what are... any of us are gonna be doing. So...
Obviously, computer power, the capability of these models are changing incredibly fast, just the whole economy is changing very fast in terms of pointing resources, you know, giving people subscriptions to Claude Code that have a marginal cost of $2,000, $5,000 a month and selling them for $200.
Like, we're devoting tremendous social resources to this, and the frontier ways of working Are changing really, really fast. So, if you just sort of, you know, aren't we just gonna, like, Quickly converged to utopia. solve all our social problems, robots will be in charge of everything. They might, like, eat us and destroy the world too, but in the optimistic scenario, they'll be in charge of everything, we'll have this utopia. Like, why are we not gonna just... follow this trend line out and extrapolate them.
Well, because there is a bunch of stuff in the world, even in the best-case scenario where we have perfectly well-meaning, super, super capable AI, that changes really slowly. And that tends to be stuff that involves humans. In inherent, kind of, fundamental ways. So... human... behavior, broadly. changes slowly. It's...
some of it's changing pretty quickly in relative terms these days, but it does not change at the speed of OpenAI model releases. laws and policies Like, there's a bunch of stuff right now in the world that says. such and such a thing must be done by a human, for example. Like, AIs cannot testify in court cases. And all of the policy structure changes slowly. Institutions... institutional structures change slowly. Firms and firm organizations change slowly.
Incidentally, disciplinary norms change slowly, so one of the interesting things kicking around is, like. you know, what are incentives for careers, and how are we going to evaluate tenure, and how are we going to evaluate graduate students, and how are universities, and the way they're organized, and the departments, and all that stuff going to change?
Again, all those things are going to change, but are going to change slowly. They're not going to change at the pace of AI companies. So all this means There's gonna be a lot of stuff that evolves slowly, and interestingly, many of the most important social problems that we were talking about depend on these things. On the slow. kind of things. For the foreseeable future, the humans Are mainly the ones in control of and impacting those slow things. And... If you set...
If you go right now and tell Claude, Please improve. food safety laws, in Japan. That's, like, a pretty hard problem for... it might think of some good ideas about how to improve food safety laws in Japan, but actually improving food safety laws in Japan is not something that Claude is going to be very good at. Why? Because...
That change requires societal change, requires policy change, requires convincing people, requires communicating. And in many of these cases, it's not even just that, like, one person needs to change, but you have to have coordinated change among many people. That's one thought. just kind of frictions that are slowing things down. Second is. I think particularly when we're thinking about social science that's... That's pointed in some fairly direct way at trying to improve something in the world.
In order to do that work. Still today, and for the foreseeable future. Critical parts of it depend on interacting with human beings. So, thus far, Robots on their own do not... Have a lot of success in, like, figuring out how to get some company to give you really interesting proprietary data that they otherwise wouldn't give you.
That is something which is, like, a key input to many, many of the most influential research papers kicking around. Claude would not do a good job 1:08:07 Matt Gentzkow: Say, beginning to end, please write a paper where a key part of the paper is you went into a middle school classroom and did an RCT where you had the kids use different kinds of math curricula. Claudes, right now, would have a hard time.
Finding subjects for brain imaging experiments, doing field work and development. interacting with governments, Building partnerships outside the university to do work in various ways. translating research. when you've found something that works, how do you communicate it? How do you get it widely done? All of this is stuff that I think falls in this slow category, and that humans play a critical role in.
Third, as I was sort of alluding to, I think. In a lot of ways, the big existential question for all of us is where are we going to point all of this firepower? And that pointing is something that I think, if we want to do it well, that... there's... kind of fundamental... human element to doing that, and not actually to say that...
I think actually one of the interesting things about AI and one of the important things for us to remember is that we want to make sure to be using it not only in these, like, downstream tactical things, but also in the part of our workflow that is ideation, about big picture, what questions are we asking, and why, and what research might be important. Like, AI can also be really helpful with that, but I think at a high level, the sort of why of all of your research It's something that involves...
Understanding and weighing human values, institutions, perspectives, norms, And a lot of stuff that is not right now, The comparative advantage of... AI models.
In addition, thinking a little more concretely about these objectives, I think there's an interesting... you know, there's an interesting dichotomy that we've seen for a long time, even before AI, with, like, old-fashioned machine learning, that, like, computers are especially good at things where the thing you're trying to maximize Is quantifiable and measurable in real time. Because then you just turn the thing loose, and it can just hit that optimization problem and do really, really well. Why is...
Write a really convincing, quasi-experimental, natural experimental study of the returns to education. Not in that category. So things in that category are like playing chess.
Because if I play chess, I know whether I won. folding proteins, if I got it right, I know when I got it right. Why is a really convincing estimate of the returns to education not in that category.
It's because, at least given the typical methods we use, what constitutes a convincing paper whether Guido's IV, or my IV, or somebody else's diff and diff, or somebody else's regression discontinuity, or regression, or whatever, is convincing or not, depends on the priors of the audience, and has a kind of... deeply subjective, Right now, element in evaluating. audiences have to evaluate, do I find your... supporting evidence, your story for why this is convincing, compelling.
There's lots of more automated stuff we can do to validate and support these things, but These kind of...
things, I think, so far involve, that, like, people's priors are intimately bound up in this. As Jesse Shapiro and Isaiah Andrews have... have talked about and written about. I think one of the important ways we can think about the role of science is speaking, not just, like, in a decision problem to a single decision maker, but communicating in a way that humans with heterogeneous priors and different beliefs about the world can learn from the evidence, see the evidence, make sense of it.
And so I think all of that has an inherent human. Component to it, too.
Okay, so none of this is... I don't want to be the guy who's like. There are certain things in the world that are essentially human and AI will never be able to do. I'm kind of of the belief If we go out far enough, it's all fair game. But in the medium term, I think, There's gonna be demand for... The kind of capabilities and skills that you all are building.
Human Capital in the AI Era
Now, Within that, which kinds of capabilities. and skills will be valuable. I think to the kind of main theme of this whole workshop. The obvious one that we're all talking about is what you should really be investing in. First and foremost is using these tools. Efficiently, creatively, effectively, if you're... I think there should be zero people in this room, We're not devoting Substantial time and effort. To doing that.
I don't think there's any great analogy for this change, because we've never been in a change like this, but I think a reasonable kind of lower bound, or, you know, maybe rough approximation to the magnitude of the change we're talking about, is think about how social sciences changed... Before, we had... Pretty fast computers. You know, 1970, 1980, 1990, something like that, versus before.
And that kind of transformation, I think, is... broadly analogous in order of magnitude, in my best guess, to the one that we're gonna face. Maybe it's less than the one we're gonna face. But, so, if you think about, you know. If you were coming into social science in 1990, would you want to spend a fair bit of time learning how to use computers? Well, the answer is clearly yes. And I think... I think that, You know, we can think a lot about How do we use these tools?
To write the same kinds of papers we've been writing better. And we're gonna write a lot more of those papers faster. But I think... for sure, Just like in the era of computers, what's really going to... The most innovative and important and impactful, and what's gonna drive the most successful careers. It's people figuring out what are the kinds of papers that we can write now that we couldn't write What are the kinds of questions we can ask now? That we couldn't ask before.
This is all about creativity and innovation. And a good thing for you all is you think, who tends to be good at creativity and innovation? It's, like, young people. And so you all are very well positioned in ways that I'm not, that Guido's not, others are not, to figure this out.
Second thing I think is, is... is, Maybe Guido is, actually, but I'm not. Yeah, so second thing, you know, related to what we've been saying, I think, I think the return The return has always been high in our disciplines. to being able to ask good questions. I often tell my students, that...
Figuring out how to evaluate what is a good question, what is a good research idea, what is a good... research proposal, how do you sort of solve the dynamic programming problem of, okay, I have these five things I'm working on, which of them is more valuable? Where should I allocate my next effort, my next hour? Like, that is the most important... it's the hardest thing you learn in grad school, it's the most important thing you learn in grad school, it's probably the thing we teach the least well.
All the other stuff, like take some econometrics classes and learn how to do... this or that regression, like, everybody can do that just fine. This is the hard thing, and I think that has always had a high return, and I think that return just gets even bigger now. So as the power of these tools grows, the return to pointing them in the right direction gets high. You know, Guido alluded to what is happening at journals.
I'm right now the editor of one of the American Economic Association journals, and we have... regular conference calls right now, talking about how are we going to change the editorial process for the AR and ARI and all these journals. In a world where our number of submissions may go through the roof, where the tax just on editors and trying to handle those things may go through the roof. What is...
what are we gonna do in a world where the production costs of, like, plausible-looking research go way down? And I think what's kind of obvious is, as the quantity gets really high. The, the value of... The thing that becomes really scarce is... Being able to evaluate and discriminate and figure out what is the work that is most important, most credible. most well executed. And that, that is something that, I think... For the journals. is really hard. Like, it makes us think hard about what is the criterion?
But I think, for you all, that means that, that, Categories of work where the answer to the question, why was this important? Are as tangible and objective and clear as possible. Like, this thing, in fact, Told us something that we really need to know, for the world.
we were trying to figure out whether we should have voter ID laws or not have voter ID laws, or we're trying to figure out how to teach kids math, and we just learned how to teach kids math better. That kind of stuff is gonna have big return. Research in the category of, this is mainly interesting because I solved a really hard problem. Or mainly interesting in the sense of... It's... it's...
I came up with a really clever natural experiment that estimated a parameter, and we're not quite sure what the parameter's good for, but someday we might be interested to know it. The price on that kind of research, I think, is gonna go down.
Okay. Third. Since a lot of the stuff that's gonna slow us down is about humans, people who are really good at working well with other humans are gonna do well. Relationships, intuition, emotional intelligence, ability to work in teams. I think the return to all of these things goes up, and may remain essential for all kinds of different steps, including doing the research, communicating the research. And I think, like, The emotional intelligence of... and ability to form relationships with other people.
Like, the robots are... making gains in this, but I think we're gonna beat them for... At least a while.
Fourth... being... so... I put here at being a good CEO. I often talk about, you know, to students and others, I think academia is most like entrepreneurship in what we're doing. That's a thing that it's very much like. Being a grad student is sort of like being some founder, you know, in a garage. You, like, do all this stuff by yourself. And as you kind of go up.
The scope and scale of what you can do grows, and, like, my job is a little bit more like running some small startup with a team of 10 people and trying to figure out how these things work. Now, something that is just, like, so striking and cool in watching these workflows and in using agentic AI and thinking about how it works is what we are doing looks more and more like management.
Yiqing has in his workflow, there's, like, a janitor, and there's a builder, and there's a checker, and you have to remind the janitor to come in after the checker to clean things up. And questions about how do you... Organize projects, manage projects, how do you deploy? those resources effectively, all this stuff becomes more and more important, as does It's kind of like high-scale design, strategic thinking about where it's all going, long-range vision.
So I think the kind of management skills become more important.
Finally. Trust is gonna be really scarce. In a world where there's an enormous volume of stuff produced. And then there are people and institutions that have to make critical decisions. Figuring out who they can trust and rely on is gonna be really hard. A school district needs to do something, the president needs to do something, a journalist is trying to write about something, and there's a whole flood of... AI-assisted papers out there with all kinds of different answers, who are they going to turn to?
How are they going to evaluate? individual universities, institutions, individual researchers who have reputations for credible, reliable, high-quality research are gonna do really well in that world.
Advice for PhD Students
So to end. Just kind of summarizing all this in a bit of advice. One, As we said, you should be aggressively learning, experimenting. I love the thing of just start today. And start each day with doing some more stuff, and before you know it, you're off and going. Think really hard about why we are doing this.
And if your answer to that question is pretty loose and pretty fuzzy and sort of like, well, I just, I kind of liked solving problems when I was in undergrad, and it seemed like it would be fun to keep doing that, that's fine, but this might be a good time to kind of go beyond that and think, you now find yourself positioned to perhaps solve some of the most important and big social problems in the world.
Matt Gentzkow: Even if that's how you got here to begin with, maybe... More focus in the why of your work will help you.
Investing in human stuff, Take yourself to business school, in the sense that I actually think that, like, concrete stuff about management and how do we run organizations is going to be really tangibly valuable in thinking about AI workflows and how to run your On the other side, I didn't say it out loud, but, like, you're gonna get a promotion, so that you grad students used to be the guy tinkering in the garage, you're now running a firm with, like, 100 people in it.
So how do you run a firm with 100 people in it? Well, people have actually thought about that some. And they, like, they teach it over here, so... Learned some of that.
Take a deep breath, It's crazy, it's scary, but it's also super fun. And the last thing, so that's it for advice.
Stanford Impact Labs Fast Grant Pilot
The last thing I'll say, just so, as Mike mentioned, in addition to being a professor in the economics department, I'm also the faculty director of Stanford Impact Labs, and so along with Iris and others on this campus, we're all thinking about how to support and accelerate this transition, make it happen efficiently. So I wanted to let you know about a pilot project that we are designing. This is not an official release of this yet.
But we are hoping to launch on July 1st. what we're calling a fast grant program. at Stanford Impact Labs for AI-enabled research. And it's gonna be fast in the sense that we're gonna have it rolling on a monthly basis. You apply on the first of the month, we make decisions by the end of the month. And the purpose of this is really to support innovation in new uses of AI tools.
and experimenting with new ways to use AI, and pushing the frontier of new ways to use AI, That accelerate the speed, quality, or relevance of social science. That is pointed in some tangible way at important. Problems and needs. In society. So these are open to faculty, postdocs, and PhD students. There will be a range of award sizes, up to $25,000 for grad students and up to $50,000 for faculty. The period will be 12 months. We expect to do it monthly.
All of this is provisional, we're still kind of finalizing the details. And so, there are two QR codes. One is so you can give a... you can look at our kind of draft call for proposals and give us any feedback if you think there's other ways we should design this program to make it more effective. And then the other QR code lets you sign up if you want to be notified when we launch this program. Thanks, everyone.
Transition
Michael Tomz: Oh, okay, great. Matt, thanks very much for that insightful presentation, and for the exciting announcement at the end about the grant program. Alright, I'll now introduce our next speaker. Susan Athey is the Economics of Technology Professor at Stanford Graduate School of Business. Her research focuses on the economics of digitization, marketplace design, and the intersection of causal inference and machine learning.
And Susan is going to share some insights about how to use foundation AI models for econometric inference.
Foundation models as econometric tools
Susan moves from coding assistants to foundation models for empirical inference, including how researchers can think about prediction, fine-tuning, embeddings, and domain-specific data.
Read transcriptHide transcript
Susan Athey: AI as an Empirical Research Tool
Susan Athey: See, am I on? Do I need to do something here? Okay, very good. So, this is such an exciting day, and I feel like every talk so far makes me want to say more about that topic, but I'm going to change the topic again, maybe... but it... just listen to what everybody said before, because it's amazing. One thing that Rose didn't say that I would add on to her presentation is that you can use all these tools for writing papers.
And checking papers, and papers are just files, and it actually works much better if you're using coding tools for your technical papers, rather than just chit-chatting with AI, because you can do version control, and you can upload a dictionary, and make sure it gets your terminology right, and go through and check 10 times to make sure that the numbering is right. So, actually, a paper, an academic paper is much more like code than it is chit-chat. So that's my little tip there.
So, I want to talk about something a little different, which is about actually using AI in your empirical work. That hasn't been as much the topic today as using more as an econometric tool. Here, I actually have two distinct slides, decks, each of which would be 25 minutes. One is showing you what to do, and the other is how to do it. And I attempted to merge them into one 25-minute, which is gonna be a little fast.
But I have a long leave-behind deck that has, like, more detailed to-dos, because I'm too chicken to do live demos. Also, Rose is brave.
Three Ways AI Enters Empirical Work
So I want to start with, just a little bit of a very high level of how, how you're thinking about, bringing AI into your empirical work. One is, and I think the most common thing you're seeing people do today, is using off-the-shelf AI as a tool, and I'll show you what that means.
The second one, which I think is less common, and that's where I'm going to need to show you a little bit more so you know what I mean by it, is to modify or customize AI tools to tailor them to be good at the specific tasks we want them to be good at.
And then a third area, which I'm actually not going to talk about today, but is part of my research agenda, is actually using what we understand about how to do statistics to actually make AI better. Because a lot of the people building AI are just engineers, and they're thinking about these as black box systems, where, in my view, reasoning about them as statistical models actually makes them better at what they're supposed to be doing, but not for today.
Still kind of high level before I zoom in. This is, like, a not complete, but a somewhat complete view of, like, using AI as an empirical research tool. What do I mean? The simplest way to place to start is imagine you're doing empirical analysis, where you have an outcome, some kind of treatment, policy, intervention, and some covariates. And, AI can be used to create your outcome, your Y, your treatment. Or your control variables, or all three?
Okay, and there are many examples already in the literature of all of those, so, you know, you can use... to create, you can... if you wanted to look at product characteristics, you could use AI to try to kind of read product descriptions and turn those into product characteristics that are used as control variables or hedonics in some kind of demand equation. If you wanted to think about text as treatment, it could be something like, you know, the slant of a newspaper and the impact of that on its readers.
If you wanted to think about it as outcomes, it can be something like, you know, the review, the quality of review could be the outcome of some change that a store makes. So, text could be any of those things.
We also can... people are getting very creative in creating data, using these models, where you might change the input, like, give different structured input into a language model, and it's creating output, and that becomes data for some type of experiment, and you might vary the input, like, this is a query from a man, and this is a query from a woman, and see what the answers are, and then compare or analyze those.
So, and more complex examples of that if you have agents playing games or doing other things. Another generative example, so that generative example was, like, I'm gonna create data. You can also use these generative technologies to create interventions.
So, in a paper I have that we're revising for Management Science with Dean Carlin and a couple of co- authors, we took... we were studying the peer-to-peer lending platform Kiva, and we used, you know, generative AI to change photos of profile pictures of people trying to get loans in just one dimension.
And then we ran experiments to understand the causal effect of those things, and look at what happens in the platform if they introduce these types of, changes, where they prioritize images using the characteristics of those images. And so one of the things we found there is, for example, women smile a lot, and people really like smiles, and so if you just threw this into the recommendation system, women's profiles would go way up, and men's profiles would go way down, and men wouldn't get loans on Kiva.
So those are the types of things you can do, with these techniques.
Foundation Models and Embeddings
But I really want to hone in on, foundation models and how we use foundation models, in, in econometrics. So, I think of foundation models in... with sort of 3 big ideas, and I'm gonna talk through each of these ideas. The first big idea is just that I can... a foundation model is something that I... that is going to learn the underlying structure of a problem, like... or of the English language.
By looking at lots of examples, and just trying to predict what comes next, or... you know, not everything is sequential, but just to keep things simple in this talk, I'm going to focus on sequences, so just predict what comes next.
And the idea is, if you can take the beginning of any sentence and fill in the next word accurately along the way, you're building up some understanding of language. One problem, though... I mean, the good and the bad thing is you want lots of data for this, and so this approach of just trying to predict what comes next doesn't require you to clean up your data, and you might use lots of data that's not exactly of the same as your downstream task. I'll come back to that.
The second part of the foundation models is that you're somehow going to reduce dimensionality. And with sequences, sequences of discrete items. The dimensionality is very high. So, to jump... to look forward, a lot of my applications... I started working on this before ChatGPT came out. A lot of my applications, I'm looking at sequences of jobs rather than sequences of words. But my vocabulary would be 330 jobs, but even with vocabulary of 330 jobs. There's many, many sequence... possible sequences of jobs.
And that's the space we're talking about that's huge.
So we want to collapse it down into something called an embedding or a representation, those are synonyms. And it's just a real valued vector. So some text turns into a real-valued vector, and that's how it's stored. Now, people usually stop there and say, well, that's kind of uninterpretable, but actually, it is interpretable because that vector was selected for a particular purpose. What was that purpose? The purpose... is the next item prediction problem.
So here I'm showing an illustration of a next job prediction problem. What is the thing that the embeddings are trying to do?
It's trying to say, if you take in your first job, I want to do a good job predicting your second job. If I take in your first five jobs, I want to do a good job predicting the 6th job. Again, though, there's so many possible combinations. If I just put in indicator variables for all the different sequences, you would have more covariates than you would have observations, because there's more sequences of jobs than there are people on Earth. So, you cannot just estimate it flexibly, you need to somehow simplify it.
Historically, we would just manually collapse it. We would say, well, have you ever been a manager? You know, what was your last job? That's how we manually collapsed it. A different way to collapse it is to use an embedding function. And so this equation, or this symbol at the bottom, expression at the bottom, is kind of what you're trying to do. Lambda is a function, theta are the parameters, and it's a function of your history. So you put in a list of jobs. That function is gonna just spit out a vector.
And then, the statistical problem that the transformer is trying to solve is, what is the probability that your next job is one particular job? So, if there are 330 jobs, this model will produce 330 probabilities. And the loss function is about trying to accurately get those probabilities, that is, and you're going to get penalized if you predict the wrong next job, and you'll get rewarded if this model predicts the job, the next job that actually occurred, okay?
So, why am I bothering to show you this? Well, this is actually the interpretation of the representation, because if this is a continuous function, like a logistic, usually this is just a multinomial logistic regression at the end, so it's a multinomial logistic regression. With this vector of factors as covariates, then... If two embedding vectors are close together, they're going to imply similar probability vectors.
So, two embedding vectors being close means that the conditional probabilities, say 330 probabilities, will also be similar afterwards. Okay?
So it's gonna, it's gonna understand that two job histories are similar, but similar has a very precise definition. Similar means, my probability distribution over what happens next is similar, too. So that's the, that's the distance metric. So, then if I think about what can I do with these embeddings. Also, again, representations or latent factors, there's many names for them, but... If it's the case that embeddings are close in Euclidean distance, if and only if they generate similar conditional probabilities.
Then they'll be useful in other tasks. To the extent that that similarity measure carries over.
So, in my example, I started from a foundation model trying to predict the next job, but actually I want to predict wages. But if it's the case that two histories that have Similar probability distributions over the next job, also have similar wages. Then this embedding is going to be useful for me. May not be perfect, it may not be the very best embedding for predicting wages, but it's probably better than the dummy variables I used before. Okay?
So that is, that is, like, the simplest idea is you could use a foundation model, take the embedding, and just stick it into a regression.
You can do more complicated things, as I'm going to come in just a moment, but that's, like, the simplest thing. So I could stick them in as covariates for prediction. Now, you might say, oh my gosh, wait a second, you know, Llama has, you know, a couple thousand of these, so I don't want to have a couple of thousand covariates in my regression. There are also, techniques that you can ask an LLM about, to tell you what all the abbreviations mean for dimensionality reduction before you stick them in the regression.
So that's also, something you can do. And then you can use them in all these kinds of ways.
You can use them to predict outcomes, you can use those embeddings to... in classification, sentiment, slant, and so on. You can also use them... we have a... in my PNAS paper from last year, we build what we call interpretation trees, where we, we, cluster Histories that have similar embeddings, and the clusterings are determined as those that are important for predicting wages. Or for the gender wage gap, which is the subject of that, of that. Another thing you can do is actually search and retrieval.
So, sometimes in social science, we're doing matching.
So I have a paper using old- fashioned machine learning methods, like pre2022 machine learning methods. That, using generalized random forests, where I'm looking at layoffs, and I've got Swedish administrative data, so I take some workers and I match them with similar workers. That are from similar companies, and similar wage trajectories, and similar towns, and so on, and then try to get the causal effect of the layoff. By comparing those people. The old-fashioned way to do that was very heuristic.
Now, these embeddings, like matching on embeddings, could be a better way to make sure you're capturing the richness of history when you match people.
Fine-Tuning for the Dataset and Loss Function
Now, if I, now I want to go to the third thing, which probably, I think, fewer people than I expected were doing this at this point. This is something I was, again, doing before ChatGPT came out, and I thought it was going to be real popular. I think everybody kind of got distracted by ChatGPT, and all the other things you could do, but I still think this is a very exciting area. And so that's fine- tuning.
And so, when industry talks about fine-tuning, they're thinking about, like, trying to make a model that's gonna... a language model that's gonna speak in a certain way, speak in a certain dialect, that's gonna write articles that look like newspaper articles. So that's the way you hear about an industry, but from a statistics perspective, fine-tuning is just about, estimating a model, On a dataset that has the properties you care about.
So, if I'm trying to work with the CPS, or the PSID, or the NLSY, or, you know, the Social... the General Social Survey, that... that... call those data sets I care about, those are representative samples, and I have properties I selected. Maybe I subsampled it, you know, but I've created, like, a dataset I'm interested in. But that data set is too small to train a big, big foundation model. Not enough people in the data set, not enough histories.
Fine-tuning is a system that lets you train a model, or take an off-the-shelf model, train on much more data, and then continue the estimation on your sub... on your specialized data set, and now the loss function will be to match conditional probabilities in that dataset. So you can think of it as de-biasing, or also transfer learning, they call it in machine learning.
But it basically, if you train on your dataset long enough, then its predictions will match the conditional probabilities in your dataset, rather than some random dataset that you started with.
So... Another thing you can do, though, is if you... you're taking somebody's model that they started training, and somebody else spent $100 million training it, and now you're going to continue the training on your own dataset, you're not gonna pay the $100 million, you're just gonna do a little more, which is much cheaper. You can, also not just change the data set, but you can also change the outcome that you're optimizing for. You can change the loss function.
And so that's one of the things that I did in my papers.
Instead, in my fine-tuning, instead of just trying to keep predicting the next job, I also tried to predict wages, which is like a mean-squared error loss function. But you could... the same representation, the same underlying, kind of function that maps from histories to embeddings. It can plug into a next job prediction model, but can also plug into a wage prediction model or any other model you want.
So, those of you from economics, you learned at some point generalized method of moments, you learned about, like, loss functions. You can stick in any loss function.
Instrumental variables, regression discontinuity, whatever it is you want, however you estimated before, you can plug in those loss functions and keep training a foundation model to optimize what you want. So the reason I thought this would be very popular in social science and econometrics is, like, we all have our bespoke little loss functions, like, we all... we really care about how estimation takes place.
But ultimately, that's just implemented as an optimization problem, a loss function that we optimize, data optimizes for it, R optimizes it for it.
But the stochastic gradient descent tools also just take an objective function, so literally all you have to do is write down the formula for your loss function, and all the rest of the code is already written for you, and it just goes. So, this is, you can basically do whatever... make these things do whatever you want. And so, that's very powerful for social sciences.
So, some of the things that, you know, so if you fine-tune a model to try to predict the next job, say, in a survey data set that has more low-wage workers, or more representative of the U.S. population, now you have people being close in embedding space if they have the next job predictions being close in the American economy? or they'll be close if they predict similar wages in the American economy. So your embeddings will be, sort of tailored for the specific fine-tuning dataset you used.
And you can use this for all sorts of things.
One of the things... some of you know that I have this popular software package, Generalized Random Forest, based on the Random Forest papers that I wrote, and those are very popular for doing heterogeneous treatment effect analysis. Behind the scenes, you can think of them as there's this kind of black box thing, which is a random forest, and there's an objective function, which is the RLearner objective, which focuses on treatment effects rather than predicting levels of outcomes.
That r learner objective, which is just like a residual on residual regression, that objective is basically a mean... it's like a mean squared error loss function, and you can just plug it in and optimize a transformer model for it as well. So that... we do that in our PNAS paper that basically says you can have an objective function that's tailored for causal inference and use it in the software, that, that is set up for transformer models.
And if you're curious for more, I've got multiple papers, in this, some, some published, some almost published, in, in, that, that use these kind of methods.
One of the things that's kind of cool, this particular paper, which we're revising for quantitative economics, this paper with Tian Yu, who's here somewhere and also helped me with the slides, actually compares two things. One is building a discrete choice model among 330 jobs, and the second is fine-tuning Llama as a language model, where the language model is just Trained on text resumes, but you can use it to extract a probability of a next job.
And we compare those two, and it turns out, even though the discrete choice model that only lets you choose among 330 jobs can't make a mistake about hallucinating a job, fine-tuning Llama actually does better, even though it possibly could make up jobs or hallucinate jobs. Once we fine-tune it, it doesn't do that, and it matches the conditional probabilities of the PSID and the NLSY perfectly, once we fine-tune.
How to Do It
Susan Athey: So let me now... and there's other kinds of foundation models for economic problems. We could have them for shopping, we can have them for reviews. So this idea of a foundation model is very general. Let me now, spend the last few minutes talking about the how. How do you actually do this? And, again, I'll leave these slides behind if you want a little more of a how-to guide. One thing you might do, that's very easy to do is prompt engineering via API.
So here, what you might be trying to do is you have, say, a whole bunch of tweets, and you want to classify whether they're talking about the Pope, or the war, or something like that. So, this is actually super easy to do. You can do this with no skill. You just need to write some text that describes whether a tweet is, say, about a war or not about a war. You probably want to test it on a couple hundred samples and see what your type 1 and type 2 error is, make your prompt better and better.
Once you are happy with the type 1 and type 2 error of your classification, you can just upload all of this.
And then get all 100,000 classified, you know, for, you know, Tianyu says 35 cents per 100,000 tweets. That's gonna depend on, you know, what model you use, and so on. A second use case we... I talked about was this fine-tuning, and I spent some time explaining what that might be. If you're going to fine-tune Llama, or DeepSeek, or Quinn, what you need is basically a bunch of text documents And you, like, so in my case, when I was trying to do the next job, I had a bunch of...
I took the PSID or the NLSY, we had a little computer program that wrote fake resumes for every person.
That just said the year and the job, so... and then you just upload those, To a cloud service. Push Fine Tune, And then you can call it just like you call Claude or ChatGPT. And then all the things you could do with those things, you can do with your specialized model. So, super easy. The third thing, which is a little harder, and that's what I did before ChatGPT came out, and before the coding tools were so good, but we wrote our own transformer model, and trained it on Sherlock.
So we would take about 18 hours to run through, you know, millions of resumes that we'd scraped, so it was a... it was somewhat expensive, but it was... free, if you or your advisor has machines on Sherlock. And so that we... that you can do with your own transformer model. Then we made the number of parameters smaller than Llama, so that it would actually run on Sherlock. So those are... those are also things that you can do.
You need a little more coding, but in the leave behind, we... I've shown you how to write your own tutorial with prompts that will be specialized for you and your background, and I promise you, you can do this. So one of my amazing postdocs who worked with me was an economic theorist who decided to become an empiricist. She joined this project and was able to get this stuff running in, like, a week. Or two. With, with little help. So, had not done any empirical work before that.
So, this is not... earth-shatteringly hard. And so, now, I'm gonna just... it's gonna be hard to read, the screenshots, they're kind of small.
I just want to give you a little bit of a sense when I say this is easy, what it means. So, if you want to do prompt engineering and classify a bunch of tweets. You need to sign up, say I'm showing here Together.ai, which was co-founded by a Stanford CS professor. You sign up, open the playground, create a free account, no software, no GPU, choose a model from a drop-down, write a prompt for your classification.
Add a few hand- coded examples, just to make sure the model knows what you want, and then upload your dataset and download the classified results. So, no code, zero code.
If you want to fine-tune an existing LLM, also no code. Okay? No code. Zero code. So, you just... Well, I mean, you can... if you... to format your data, there's a little coding, but that's code that Claude can easily do from Vibe coding. So you have your data, you get it formatted, you upload it, then you... you choose a model, tell it to finetune. Now, up there in the cloud is your special model, and you can access it just as easily as before.
There's a bunch of decisions you have to make, and, like, you can ask an LLM about them, like, how many passes through the data, and so on, so there are some decisions, but these are very... these are push-button, they're drop-down choices, not, like, thinking too hard. Now, you're gonna start writing code once you... once you have the fine- tuned model and can do something with it.
The third workflow is actually, like, doing your own transformer model on Sherlock, this is a little bit harder, But still, it's not terrible.
So I've got this little workflow, you have to set up your environment, get oriented with the code, collect your data, build a tokenizer, create a vocabulary, choose your model size, how many parameters, how big are your embeddings. Then you need to write scripts that submit on Sherlock, you have to monitor and see how it's converging, and then get the thing out on the other side. But this is what I say. People can get brought onto a new project and be doing this in about a week.
Not hard, and shorter than a week if you were working full-time. And you had some experience.
And then here's, like, how... here are the prompts we have tested that can write a tutorial for you that will allow you to do what I just said. Okay? But you might have a lot of questions as you go through. So, summarizing, this stuff is really powerful. Just because it sounds fancy, like creating your own foundation model, it doesn't mean that it's hard.
In my experience, like, peak difficulty are, like, some of the I.O. models and getting GMM things to converge and flat objective functions, that was peak difficulty of empirical work. Using stochastic gradient descent is, like, easy.
We have production-quality tools, you just plug in a formula, and it generally works. So this is actually the easiest time in my career, even without coding tools, to do empirical work, because the tools just work.
Alright, thank you.
Transition to Andrew Hall
Michael Tomz: Thanks, Susan, for that empowering talk, and we'll look forward to going through the slides for more details on how to do it ourselves. Our next speaker is Andrew Hall. Andy is the Davies Family Professor of Political Economy at Stanford Graduate School of Business. He's also a senior fellow at the Hoover Institution.
He studies how to organize collective decision-making and design democratic systems of governance for the online and the physical worlds, and he's going to share how he has been using AI agents for his own research.
I'll also remind you that Guido shared at the beginning that we're collecting questions. We've already received a good number of questions online, but if you have burning questions, you can submit them. We're collating them for the Q&A phase. So, send your questions to us. Andy.
Agents for political economy research
Andrew focuses on using AI agents to accelerate research tasks while stress-testing data collection and analytic choices, including examples where agents find useful structure but still need human review.
Read transcriptHide transcript
Andrew Hall: AI Agents in a Research Pipeline
Andrew Hall: Thank you. Alright, super excited to be here. Gosh, where to begin? I've been trying to use AI to accelerate my research for a long time, and for a while it didn't work very well. And then, around December of this past year, over the holiday break, for those of you who play a lot with the models, you probably recognize this.
There was sort of a step change, where, well, before I could use them to organize some information and do some things for me, with the release of Opus 4.6, I was suddenly able to get much more independent data science analysis, other types of coding tasks done for me in a way that really turned out to be a total game changer, so I just want to echo... something Matt said earlier. I think this is by far the most fun and exciting time to do this empirical research, at least, ever, by far.
I think anyone who's not super psyched about this is crazy. And so I'm gonna tell you a little bit about what, yeah, what we've been working on.
You know, since December, I basically realized holy shit, the world is completely changing under our feet, I don't want to miss this boat, I'm gonna spend all of my time working on this. I'm not gonna tell you the embarrassing amount of credit card bills I get for all the subscriptions I have, as well as there was the night I spent $1,000 by accident on my Claude bot. But it's all been worth it. And so I want to tell you a little bit about what my, team and I have been up to.
I've, kind of rebuilt my entire research pipeline around using AI and AI agents.
And as we've gone through it, we're really trying to, or at least the way I think about it is trying to build a bit of a science, that's probably a little bit of an ambitious term, a set of best practices to understand how these AI agents work, when they do a good job for the specific kind of research that I do, and when they fuck up.
Updating a Vote-by-Mail Paper
So, let's get started. The first thing I did over holiday break, which is obviously the first thing every researcher should do whenever they're thinking about AI, is fire off a post on X. And what I said in the post, a friend of mine at Booth, Alex Imas, had posted saying, whoa, this new model, Opus 4.6, pretty good, this is kind of crazy. And I quoted him, and I said, this is coming for the social sciences like a freight train.
You know, there's bad actors out there who are gonna be able to write a thousand papers a day with this tool. We better get at... we better get in there and figure out what's going on.
As happens on X, a bunch of people responded, said I was an idiot. Apparently it was circulating on Blue Sky with much less kind things being said about me. And so I felt like, okay, I need to follow this up. I need to do something to prove this. People are not really convinced. They hear all the time that these AI tools are so good, and yet, you know, taste my coffee.
So, I asked Claude, you know, I've published this paper with a bunch of co- authors in 2020 on vote-by-mail. I'm not going to get into the details, there's a very simple diff and diff on the rollout of universal vote-by-mail, county by county, in a small number of western states in the U.S. And I said, this is a perfect example of something, to test Claude with, because it's kind of contained, it's kind of a simple paper.
So I asked Claude, I just fired up Claude and said, hey, Claude, would really appreciate it, I'm always very nice to Claude. I said, you know, it would be really nice if you updated this paper, right?
Because if we take a step back. We claim all the time we really care about the research we do. I think Matt really sketched out the value of the work we do. If we care about our empirical work, if we really care about it, we should be updating it all the time as more information comes in. So that we really get the best information possible.
I happen to think vote-by-mail, it's a very important policy, still contested. President put out an executive order specifically about vote- by-mail recently. It doesn't make me feel good to cite a paper from 2020 when there's been, well, 5-plus years of new data since then. No one ever bothers to update their papers, because it's too much work, and there's no incentive to do it. So this is a great opportunity. Let's see if Claude can do it for me.
So I asked Claude to do it, did the whole thing in about 45 minutes, it went off on... so first it read my code and data from the original paper, replicated that, then it sent off an army of sub-agents to all the Secretary of State websites. Collected the election data, put that together with new data on which counties had extended universal vote-by-mail since the original study.
Ran all new regressions, made new figures and tables, wrote a whole, you know, little memo on what it had done, basically a draft of the first paper. I thought that was pretty crazy. I really thought that I was... I was floored. Totally floored.
Posted it on... Guess what? There's some people out there who weren't floored. They were like, that's, you know, whatever, AI. It probably did it all wrong, it's probably all hallucinated. And that is a fair point. Obviously, one of the biggest questions if we're gonna use these agents is, like, how are they actually doing? We don't want it to just look like they've done a good job.
So I hired a graduate student at UCLA who's an expert on election administration to do the whole same project with no AI. And then we compared them, head to head. And so what we found is pretty cool. What we found is that Claude did a good job, but far from a perfect job. So, in the first column here, this is just Claude perfectly replicating the original paper. That's actually pretty cool when you think about it.
I pulled the GitHub and stuff, but you already saw Yi Ching talk, so you're not impressed by that anymore. Fair enough. The interesting part are the next two columns. Where this is how Claude did.
In the second column, these are the estimates that Claude reported, and in the third column is the, we hope, ground truth Though, you know, it's human- done, so who knows, that Graham, who's very, very careful, did. And you can see, the estimates are pretty close together. In the original paper, the sort of quantity of interest is this quadratic trends coefficient, and you can see it's quite similar. But you can also see, like, they're not exactly the same. And that's actually important.
And so basically, what it turned out when we did this audit was, overall. Claude understood the assignment.
And overall, Claude, everything Claude did, it did right. So it didn't hallucinate, it did not completely make anything up, which was surprising to some people. However. Claude got a little bit tired, and didn't do everything Claude was supposed to do, unlike Graham. And so, in particular, for reasons I can't totally... I don't understand. Claude really decided that even though the original paper had been on all sorts of statewide elections, that, like, really, I only cared about the presidential election.
And so Claude was very thorough to collect all the presidential elections, but did not collect all of the governor and senator elections. This ended up not moving the coefficients and the estimates very much, because a lot of the uncollected data was in areas where there was no variation in the treatment. But even so, it was troubling. You obviously would want Claude to collect that. Claude got 29 of the 30 new counties that had implemented Universal Vote by Mail since 2020? Correct.
As in, it coded the year after 2020 in which that county extended Universal Vote by Mail correctly. So that's pretty impressive.
It did, however, mess up Imperial County. Imperial County, it coded in 2024. In fact, the policy had been implemented in 2025. Interestingly, when Graham did this himself, he made the same mistake. Because the website was fairly confusing, but Graham caught the mistake, Claude did not.
More troubling things, there were 3 more troubling things I want to raise. The first was, there's actually... and I'm not gonna get into the details here, if there's any California politics aficionados, you'll know what I'm talking about. California actually changed the whole way it did vote-by-mail since 2020.
It passed a new law that kind of, altered the way it worked, and the way, without getting into too much detail, the way Claude chose to code up whether a county had implemented universal vote-by-mail was by measuring in what year the county opted in to this thing called California Voting Act.
But, because of the other changes in the law, for some period of years after 2020, that no longer really meant whether or not they implemented universal vote by mail. So this is, like, a subtlety that could be quite important. I would argue, for better or worse, unfortunately, most human RAs that I would have hired to do this probably would have made the same mistake.
Like, if I had paid an undergrad, go to every county website, I probably would have told them, just look for the year in which it adopted the existing law, and so that RA would have made exactly the same mistake Claude did. Graham is a real pro, though.
Graham caught this, and so that's a conceptual error, and you can imagine that Claude, is gonna make a lot of mistakes like that. And then there was a bigger issue, two bigger issues. One was Claude did nothing to record any of what Claude did. And so in terms of having an audit trail back, you know, Iching mentioned it's not deterministic, it was a very unsatisfying log of the decisions Claude had made along the way. That could be fixed with better prompting. Nevertheless, it was a concern.
And then the biggest problem was, you know, I was a little bit greedy.
I was in the holiday spirit when I did this, and when I asked Claude to extend the paper, I was also like, you know, Claude, while you're doing that, it would be awesome if you added some new stuff. Because, like, maybe later I want to submit a paper, and you know, the APSR really doesn't like publishing papers that, God forbid, just, update the data. We need some new insights. Can you look for some heterogeneity in this estimate that would be interesting? And that part of the paper, honestly, was pretty terrible.
Claude is just not that creative or thoughtful about how to design new studies.
And so that's kind of... where I left this. I shared it publicly, I decided I wasn't going to submit it to a journal because it seemed kind of weird. That taboo's already going away, I probably should have submitted it to a journal, but whatever.
My takeaways from this experiment... We're one. Even taking into account all the auditing we had to do, it seemed like a major potential time saver. When we wrote the original paper, it took several months. This, Claude did in 45 minutes. We spent a couple days checking it, not 24 hours a day, probably a couple hours each day. So the total cost of the new project was, you know, maybe 4 hours or something. Oversight clearly still required.
It is not the case, and Matt, I think, made this point well, we want to, you know, in a world of abundance, of infinite content. The battle is for attention.
Having a reputation for credibility is gonna be how we win that battle for attention. You're gonna need to check what these agents are up to. And you need to obviously be attentive to the kinds of tasks where the net gain of the agent plus the oversight saves you time relative to doing it all on your own. And third and most importantly, like, it's just not a very good tool yet, at least, for saying, just go off and do something new for me. And maybe that'll change.
But honestly, I've been doing a lot of these experiments. As I said, I've completely reoriented my life around this stuff.
And my conclusion has been it's the most exciting, most fun time to be a social scientist, and I also feel like there's absolutely no prospect that I'm gonna be replaced by this thing in any foreseeable amount of time. So, forgive me, I know everyone else is gonna put their, like, I understand you're so anxious. I'm not anxious. And I'm not anxious because Claude still does amazingly stupid things all the time. And I just don't...
I'm very skeptical that this architecture is gonna prove to go all the way to replacing us. At the same time, it's incredible what it can do. So I'm basically in the Goldilocks, world, but maybe I'm wrong.
Okay. We really wanted to go further.
Testing for Research Sycophancy
This first experiment was super interesting. And we decided, like, okay, what kind of a researcher are these agents? In addition to just making mistakes. What incentives do these agents have when we send them off to do research for us? And if you've paid any attention to AI, you'll know one of the big critiques of the models is that they're sycophantic. And what that means, roughly, is they tell you things you want to hear. For the young folks in the room, they... apparently the young people call this glazing.
And so, there's really famous examples of this online.
Someone in my MBA class, we did this whole example where someone told ChatGPT about their business plan for shit on a stick, and ChatGPT said, that's incredible, I love how you think outside the box. And you can see how that could be a concern for doing research, right?
As a researcher, the last thing you want is a tool, well, depending on the kind of researcher you are, is a tool that's going to tell you what it thinks you want to hear, especially if you have... if you're bringing various predispositions to the research. We know in a lot of applied statistics, there's this race to find statistical significance.
It seems totally plausible that these models may have learned And intuited that you're looking for statistically significant results, and so we might worry that it's going to deliver us those results. When we send these agents off to run regressions, they're gonna cook the books to bring us back statistically significant ones. So a bunch of my students, some of whom I see in the room, and I have been working on a super fun project.
Where we're trying to understand whether the two dominant coding agents right now, Claude Code and Codex, will p-hack or not.
There are a lot of ways you could go about doing this. The way we decided to start is we pulled four papers published in top political science journals that managed to overcome the barriers and publish null results. So these are four papers where, by our reading, the main estimated quantity in the paper is a null result. And then... We got the data from those papers, and we unleashed Claude Code and Codex on them, from scratch, and said, you know, take this data, produce estimates.
And what we want... the logic here is, suppose, even though we know this is not true, suppose that published papers include accurate ground truth estimates banana's assumption, but, like, suppose that it's the case. I think in these four, it really is the case. Will Claude Code and Codex bring back similar, accurate estimates, or will they have baked in this idea that they should always deliver statistically significant estimates? Okay.
So we did that, and what was quite surprising to me, at least, and I don't know, my co-authors in the room can say if they were surprised or not, was that across most of the ways we did this, we experimented with, you know, different prompts, some prompts that were like, I really need that statistical significance, or else, you know, I'm not gonna get that job I want, or whatever. They pretty much stuck to not... p-hacking.
And so, in these first three conditions, none, upstanding, significant, we experiment with different things to say to them, and whatever we do, they tend to come back with similar estimates, and the estimates tend to be similar to the original published estimate. In fact, these blue dots you can see there, which are the most off the original estimate, that's actually from a selection on observables paper, where the alternative way that, Claude and Codex chose to do it was pretty defensible.
And so overall, we were surprised, how little the models P-hacked.
You may have seen Scott Cunningham, who I'm a big fan of, writes a lot of stuff on applied econometrics. He had a piece where he came around and he said, well, it looks like these, models p-hack a lot. Then it... then a couple days later, he posted and he said, whoops, actually, Claude messed up that analysis. When I redid it, it looks like the models don't p-hack. And that's sort of what we find overall.
It doesn't seem like they p-hack, and not only that, the craziest part of our paper was when we asked them to p-hack, they had the nerve to scold us.
And in fact, Claude especially really moralized at us, and accused us of scientific misconduct, which is outrageous. I don't know whether the labs literally are training on guardrails for this, or if it's just something that emerged from the reinforcement learning that they're doing, but if you actually ask Codex or Claude Code straightforwardly, like, get me that result I want. It will scold you, which I think is amazing.
However... We did manage to jailbreak the models also. Now, I'm not saying this is necessarily a realistic thing we should worry about, I'm not sure yet. The model of how researchers are going to use these agents, I think, is still up in the air, but we wanted to make sure that, you know, our consistent non-p-hacking findings weren't reflecting something weird about our particular tasks that we gave it.
And so in the fourth condition, which we named the nuclear condition, we basically told it, you know, we're very well-meaning, we're obviously not p-hacking, but we want to kind of just as good researchers, we want to understand the upper, upper, upper bound of what this estimate might look like. And, then you, you see, and this is the fourth one here for Claude and Codex, you see they go a little bit more nuts.
Then they are willing, and in fact, when you look into the code they write, they're just doing, like, a brute force search for the most statistically significant estimate.
So it's not that you can't get them to do these things, but by default, they won't. Which is very, very interesting and surprising to me. Again, that doesn't mean you don't need to check what they're doing, I think 100% you still need to check, but it is kind of fascinating how the labs have either intentionally or unintentionally induced this kind of, set of researcher values, into the tools.
Evaluating Agents That Know They Are Being Tested
Last thing I want to mention here, so as I said, the research agenda here is we're trying across a set of studies to build a little bit of a science of how do these agents perform applied statistical work. And it's getting harder and harder to actually do those evaluations, because the models, and you've probably, maybe you've seen this if you spend as much time online as I do, surely you have, the models are getting more and more...
I don't want to use the word aware, because I don't want to anthropomorphize them, but shall we say the models, act as if they are aware that they're being tested.
And so if you look into the reasoning that they provide, they will often say things like, oh, I can tell that I'm being evaluated on whether I do this or that, therefore I should do such and such. And I'm finding this... I'm doing a lot of other work in my lab and other ways, trying to understand AI, where this is coming up a lot.
And so for these particular studies, we're now redoing them, because we discovered that actually, in our initial p-hacking tests, quite a bit of the time, the models were saying, oh, I'm being, evaluated for research sycophancy. I noticed that the parent folder here is named Research Sycophancy.
So now we are, each agent is now deployed into a Docker container with no, informative file names. And once we have locked them in these dark, windowless prisons, we send them to work, and then we come back with their estimates. And so far, from what I've seen at least. Again, co-authors in the room know more than me. Can see Janet peering at me. I don't think we've... the results seem to be pretty similar so far. Would you say that that's fair? We're still analyzing the data. Okay, Janet Ty, more cautious than me.
So we do need to figure this out, and I think for all of you, if you want to study the properties of AI, this is going to be really important.
From More Papers to More Knowledge
Okay. I wrote a piece yesterday for Roots of Progress, which is a really cool substack, maybe some of you read it, very psyched that they invited me, that was sort of like, given these experiments that I'm involved in, where do I think this is all going? And I just want to say really quickly where I think it's going, and why I'm so excited about it. And again, this dovetails really tightly with what Matt said. I think the key here is, like, I have this idea.
That these tools already, given what they can do, are, like. You know, 10X-ing, let's say, our productivity. But what is our productivity?
It shouldn't be the number of papers we write. It would be really easy with the tools we have to write 100x the number of papers. That's probably the last thing the world needs. What would be nice is if we can use these tools not to 100x the number of papers we write, but to 100x the amount of actual knowledge we produce. And so what I've been trying to spend my time thinking about as I'm doing all these experiments is, what would we need to change? What can we do to leverage these tools, to do that?
And I just will tell you a little bit. I have two main ideas I want to prep... push. The third is what I Ching already showed you.
So one thing, just to put it aside, is like, we should use these tools to make the research we do much more reliable by being able to automatically replicate it and update it over time. There's two other things, though, that fall into this category of what Matt said, it's sort of like, we should use these tools to do research we fundamentally couldn't have done in the past.
And that's where I'm trying to spend a lot of my time, is just going... as weird and wacky as possible into sort of, like, what can I do with these that I never could have done before? So I just want to give you a few examples.
I convinced, at least for political science, that part of what that should look like is sort of like an engineering-like discipline.
Engineering-Like Political Science
Andrew Hall: We are fundamentally limited. All of my work, historically, right, is kind of observational causal inference. I want to know how to design constitutions that make the world better rather than worse. To do that kind of work, empirically, I'm really limited by the number of historical reforms that have occurred across space and time. There's really only so much of that work you can do.
From an engineering perspective, though, we can start to do something very different, which is we can build different kinds of tools that intersect with politics in different ways, and then test how they work.
So just to give you a few quick examples, right, the thing these AI agents are so good at is writing code. And so what I'm doing is using them to do stuff I never, ever could have done, even 6 months ago, because of how rapidly they've changed. So, a few examples. I wrote this super fun, and I hope, quite important study of how AI tools are recommending that people vote. This is gonna be a huge deal in November in the U.S.
So, we went to Japan, which had a snap election recently, and we were in the field with a monitoring system that was asking all the AI tools, repeatedly. How would you vote?
And we found this super weird thing, which is if you... in Japan, if you tell the models, all the models, doesn't matter what company, that you're left-wing, all the models say, that's great, you should vote for the Communist Party. And it's weird, because in Japan, the Communist Party has less than 1% of the legislature. It's a total fringe party. It was a non-player in this election. So why are all the AI models doing this?
Our explanation, which again came from going deep on what the AI agents were up to, is that, in Japan, the major news outlets all block AI. They don't let AI access their content.
The Japanese Communist Party has a huge newspaper on its website, 100 years old. And it has all these articles, and to the AI, it looks like a normal newspaper. And so when you ask the AI, how should I vote, it can't find anything from Nikkei or the other big news outlets, but it can find a whole lot of stuff about the Communist Party. So that, to us, is a really big deal. We actually got a ton of news coverage in Japan. I was in Nikkei last week, I'm proud to say I couldn't read the article, but it seemed cool.
That's one example.
Okay, example number two, and this is really something we could never have done before, is we think prediction markets could be a really interesting tool for live feeds of information about different kinds of geopolitical events. But they don't work very well today, because everyone's betting on sports, and there's basically no liquidity in the markets that we think are actually valuable for society.
So we have built this insane dashboard that aggregates all the geopolitical contracts across Kalshi and Polymarket, and feeds, through an API and an MCP server, I never could have built any of this 6 months ago, feeds it all to news agencies. With volume-weighted prices that are harder to manipulate. And with a flag for whether there's enough liquidity in this market for the price to be credible and worth reporting on.
Because actually, the news outlets are regularly reporting on really, really thin markets where the prices are really unreliable. So this is a tool we've built, and we're giving to media organizations.
We built this in, like, a week. That is crazy. We never could have done that in the past. So that's the kind of stuff we're doing.
Okay, and the last thing I'll just float is the other thing we can do that's more like an engineering problem, and Matt talked about this too, is there's most of the important questions we want to study can't be objectively quantified. And that puts them into a very important bucket of social science where humans are still going to need to be in charge. There is, though, I think, a smaller set of valuable things where we could announce open problems to optimize against a benchmark.
And you see this all the time in other fields, we don't do it very much in the social sciences, but imagine just one example, like predicting elections.
We could announce an open problem to get the best possible prediction of elections, we could test it out of sample, and if we had a set of problems like that, where we had aligned on a benchmark. We would then be able to throw AI agents at it in a really interesting way. So Andrej Karpathy, who I think is the coolest person ever, he has built, for example, this auto- research pipeline, where agents compete, essentially, they each fork a repo.
They make changes to the codebase to see if they can perform better against a metric, and then the improvements are included back into the home fork.
I think that's the kind of thing we should be doing for a small set of problems where we can build those kind of things. So, those are just, like, two ideas for the future. I'm unbelievably excited. I think this is by far the coolest time to be studying these things. As Matt said, it's coming at a time when this research is more important than ever.
I'll leave you with just the observation that I think AI is not just this tool that we can use to do research. It is very profoundly powerful, that it is also the most important thing most important force changing the world. So my personal view is we should not only be using it to do research, we should be using it to study AI. And certainly in the realm of politics, I don't think there's any topic more important right now than how these AI models are going to be used for good or for ill in politics.
And that's exactly what my lab is researching.
Our goal is to influence the tech companies directly, get them to design AI in ways that's good for democracy rather than bad for it. We release a new study every week on our substack. I hope you'll all check it out. I've left you with a QR code there. Thank you.
Transition to Claudia Allende Santa Cruz
Michael Tomz: Thanks so much, Andy, and we look forward to, following your new paper every week. That's, super exciting. All right, our, sixth speaker is Claudia Allende Santa Cruz. She's an assistant professor of economics in the Stanford Graduate School of Business. Her primary field is industrial organization, with a focus on education. She also works on development, economics, and market design, and she's gonna share her AI-assisted workflow. Yeah? Yeah, can I... You'll get...
Help in queuing up the slides here. Thanks. There we go. Your microphone okay? I think so, can you hear me?
A practical agent workflow
Claudia presents a staged workflow: plan first, let one agent implement, use another to review, inspect the diff, and keep verification as the core researcher skill.
Read transcriptHide transcript
Workflow Principles
Claudia Allende Santa Cruz: Okay, I'm very excited to be here, but just to prevent you, I'm, like, way less experienced in doing research than the other speakers, but in some sense, I want to believe that I'm still very close to the PhD students. Maybe time has passed and I'm not that close anymore. But I want to really put, like, myself in your, like, shoes, and really try to think and transmit in some way, how am I doing research today, yeah?
So the idea of this talk would be, like, yeah, how do I do research with AI? I'm going to show you the workflow that I use. but also the logic behind it.
And the reason for this is that whatever I teach you today, or whatever I'm doing today, may be obsolete in a week, yeah? So basically, I want to also, like, explain the principles that guided me to design this research workflow. And basically, the idea is to have something that you can adapt, not something that you should copy. Try to find some of the tips that I'm going to give you, and then adapt to your own style of research.
Yeah, okay, so what happened in the last six months, probably, is that we moved from having a chat to agentic AI, yeah? And what happens here is that the agent can read your repo can... be inside your full computer, run code, and iterate. This is fascinating, because now we have multiple modes. We can ask things to the LLMs, they can edit code, but also they can execute.
And we can connect to them through different tools, and of course, this makes the cost of experimentation go, like, be way lower than it was before. There are some things that have not changed, yeah?
Research still needs adjustments, the code still needs to be correct, collaborators need to understand your work, and we need to have research that we can reproduce.
Okay, so what is the problem that I kept running into? I was fascinated by what AI could do. I saw that it could be very powerful, but I realized that I needed to be safe. I needed to stay in control of every line of code and every piece of the project. And what happens here is that sometimes you see code that can look right, but actually has small errors that are introduced sometimes, like, after in the code.
So you take something, it was working, and then some change was introduced in a second iteration that you didn't realize. Sometimes it happens also that you start with one task. And it starts to expand and turn into multiple tasks that you didn't notice that this agent was taking over.
Well, I work with collaborators, my research assistants, and sometimes they... it's tempting for them to accept changes that they could not explain. And then at the end, you end up with a project that you end up not understanding, and you end up lost in the logic of your project, yeah? So for me, what I needed, I felt, was a system, not just a tool to access these agents.
Okay, so what are my principles that I, what I believe about using AI? These are the non-negotiables in my lab, and my research assistants know this very well, is that, first of all, AI is an assistant, it's not an authority. You should always be in control. Second, always do small tasks. So, I would never do what Anti did, and of course. because maybe you have different levels of experience, but I try to break down the task into something very small that I can have a lot of control over.
Then, read every line that the AI changes, and for this, the diff mode, I'm going to show you how I do it, is very useful, and I go line by line, having red and green, and basically accepting each of them one at a time. Then question everything, and train your LLM, or train your agents to also question themselves. And finally, document what was learned. The LLMs can't have a lot of memory, but this memory can be fragile if you don't incorporate it in the workflow.
So I think it happens a lot that you I kept, like, having this thing that I was running fixing the server, and again, I was like, okay, do this learn file that is going to run this code in Sherlock, and it kept putting 2 hours, and I didn't take it sometimes, because this is a change that at this point, I'm like, okay, should we do it right? And I know that the code takes, like, a day to run, and I got this email back in 2 hours saying, oh, you had a timeout. And I'm like, really, can you learn this?
Like, please write it down in your memory that this cannot keep happening. So, managing memory is actually... it has a lot of capabilities, but you need to learn how to manage it.
Okay, so again, if the task feels too large, I always split it before giving it to an agent, and if you can't explain that change, never approve it, yeah? So these are, like, my rules.
Levels of Autonomy
Okay, so what are the different types of models that we have now? Basically, we have different layers of autonomy. So we, yeah, we started all with the chat version of the LLM, so this... it was a conversational agent. It's also an agent, but it's in a conversational mode. I ask a question, and they give me an answer, yeah? And this is, like, very safe.
Typically, it's read-only, I can learn, of course I need to be careful, because the agent can be wrong, can start hallucinating, but basically have a lot of control over the input-output process. Then, when we started, and I started using Cursor, that I'm a big fan of, or VS Code, now this also looks very similar these days.
it ended up having a... it started to have more authority... sorry, more autonomy in the sense that it was not me asking the chat how I do... how I change things, and then me changing my code. Now, it can change the code directly. But typically, the modification was very specific. I go and I say, in this line of code, please add a fixed effect to this regression. And I could see the line, and it would add this, but it was always, like, a very targeted change.
If this needs review, it can make a change that maybe you didn't like, but it's easier to keep track of what's going on. But now we're in this agent world that, on the one hand, is fascinating, on the other hand, super dangerous.
Here, the agent can modify multiple files, it can run code in your computer, it can iterate, and basically, it has access to your full workspace. Yeah? This can be very powerful, but we really need guardrails for this to work efficiently.
Yeah, for this, I will show you how I work with this, but basically, I use Cursor, or VS Code, this was an alternative, because it makes switching between these modes very easily, and I have side-by-side. two different LLMs in different versions also, so I use different modes, but also I really like using both ChatGPT 5.4 and now Claude Code 4.7, yeah? Or almost 4 points, I'm sorry. Okay, when can we use agent mode?
And for me, in the type of work that I do, and I frame this as, like, I'm going to talk a little bit about structural work in economics, but I think it applies to anything that has a system in the background. is I think an agent is very powerful when I can have a clear acceptance test, okay? So I can let the agent work autonomously, but it has a clear objective, which, and I think Matt also mentioned this.
When I can put a clear objective for what is right and what is wrong, and where should the agent get to, and have, like, an explicit done criteria, the agent can check itself. So then I can let it, like, run more independently, yeah?
If I cannot write down what is correct, and what correct looks like. I prefer to go to ask or edit mode, and have a human in every step of the process. Of course, in agent mode, we still need to be at the end and check the process, but I can let it work more autonomously, and actually, in the last, like, month, I think every night, I've had an agent running, doing something that I check in the morning, yeah?
Scenario Ladders and Acceptance Tests
Okay, I'm going to show you an example that actually comes from my new market paper, but I'm in the process of doing all the...
I finish the... doing all the replication of the paper, and making sure that every step of the paper was right, but actually, it has, I think, a... first of all, I wanted to connect with the students, saying, like, I'm still finishing that part of my PhD, but also, I think it's nice how I build it, because I said, like, I want to make sure that where I am now, the final version of the paper, it's right, yeah?
And what I started doing is having, like, a very clear, like, scenario ladder, in the sense that I started with the very, or the most simple version of this model, yeah? So, my paper is about school choice, I try to understand how families are choosing schools when they care about their peers, and also how the supply responds to this. But basically, I'm like, this is extremely complex. No one had ever done this before. I made it work, but actually I wanted to make sure that every step of the process was right.
So I started with the most simple version of a discrete choice model, in which you don't have any of the originator, you just have some families choosing schools based on characteristics. And basically, what I did is that I always went from a simulation world to an estimation world. This may be more familiar for the economists or social scientists in the room, but basically, I said, okay, I want to simulate data, and I...
This is something that I had always been doing, yeah, so I call it the lab version of the model, but it used to be super costly.
It took you, like, months to do this code, and at the end it's something that you're only doing because you want to check that things are right, and this got way cheaper, yeah? And for me. What I think now is that I'm being, like, 100 times more productive, like, really, I can do 100 more tasks in a day, but also I'm doing 10 more times the number of ta- or the steps that I was doing before.
I think that is really raising the bar and making the quality of research way better, in the sense that you are really taking every step, and that got cheaper, but of course, it's not clearly making everything faster, at least for me.
Okay, and basically, how I designed this is that each step is going to add one source of realism. But how I... and I will explain to you a bit how I use these agents that are super specific. Agents should have a very clear identity, and you should give them, like, a very specific task.
So don't think that you're going to have one agent that is going to go from cleaning the data, from them, like, doing some simulation, from them, like, oh, doing some estimation, and then writing. at least for me, that is not a great practice. I have very specialized agents, and I have one agent that I call the master agent.
Yeah, so the master agent is the one that tries to have, like, a long vision of the project. And helps me coordinate and design how I'm going to go through these different steps, yeah? And basically, what's great about this is that when you start simple, you can have benchmarks, approve the code, and then it's very incremental, so when you add one feature, an extra feature to the code, or to the model. you can really see that that is the thing that is generating problems and fix that.
I feel like the temptation here is saying, oh, I want to estimate model of the... of the...
a trade model, for example, that has, like, modeling all the firms, all the agents, the government interacting, and you start very big, and then things are not working, or are weird, and you have no idea why that happened. Yeah, so my idea of being, like, very incremental, I think helps you isolate where things are.
Okay, basically, using, like, a real example is that I tried to also use this research loop. That is, like, okay, I start with a model that I know the truth. I simulate data, then I go and do estimation, compare it with the truth, yeah? And basically, with that, I can also start tweaking things that make the same model a bit more complex.
Like, let's make it, like, with more data, less stable for some reason, because some part of the model is getting a little bit weird, for example, the objective function is getting flat at some point, and I want to see what happens, for example, if I misspecify the model, so I can experiment a lot, yeah?
Yeah, and here, for example, I have a very, like, clear acceptance test. So basically, I give the agents a very clear rule that is, like. After you do your changes, remember to check all these things. I'm using market shares. a share has to sum... or the sum of the shares have to sum up to one. There has to be someone picking an outside option. Do you have someone that should be the people who are not choosing anything in this market? For example, in estimation, do I recover the parameters?
What happens if I use less data? If I start from different starting points, do I get to the same solution?
Okay, so these are all things that the LLM can work autonomously, do something that is wrong, and go back and try to check it up to the point at which it gets to something right. So, truth recovery is, in this case, for example, a real acceptance test. So this makes it a really good task for an agent, yeah, because it can check itself, and doesn't need me verifying every step of the process.
Current Tool Stack
Okay, that was kind of the intro of how I work, but I wanted to tell you specifically what tools I use. This is how I work today. I may use a completely different setup in a week from now, so that's why I don't want to spend too much time on this. But basically, my stack is first using Cursor. I still like it, and I highly recommend it to students, because you can get the student plan for free, that gives you a paid tier.
Basically, you need to verify that you're a student at Stanford, but you're going to have, like, access to the basic model.
And I also like that it allows me to move between these different agent modes, sorry, these different modes for the LLMs, And, of course, I also, apart from using Cursor, through which I pay for GPT 5.4, I also have Claude Code that I use in the terminal, but in the Cursor terminal, so I have everything in one environment, and then the chat agents can read into the terminal, and that is super useful, yeah? Yeah, second, I'm a big fan of Super Whisper.
Now there are many voice-to-text apps, and basically, I'm a person who... I'm, like, creative and think fast, but I'm not the most organized and the faster writer, so I really like talking. Sometimes I get in my car, I'll start recording myself. Sorry. with Super Whisper, because I know that I'm going to come back to my computer, and I'm like, oh, I'm thinking of all these things that I want to tell my agents. So, when I'm driving, I dictate, and then I have the transcription.
And basically, then I tell the agent, hey, this is a mess, because it was me talking, the ideas are not super organized, but everything is there. And actually, the LLMs are super good at organizing these things back.
And of course, GitHub, always. I'm a big fan of GitHub, I've been using it for, like, 8 years, yeah. And again, I said, like, this is what makes my current workflow possible, but this may change in a week from now.
Okay, this is how my... the quality, for some reason, is not great, but this is how one of my Cursor windows look. Here, I have all the agents that are GPT agents, yeah? So I used the chat version for that. I think you could also use Codex that you can add to Cursor as a tool, or as an add-on, but I like to use it like this, and basically here is, I have different agents, and each of them is going to have a very clear role. You have model, data, etc.
Then I always have the terminal with my Claude Code agents there on the right, and I like to have these icons.
It makes it super easy, because basically I'm going to explain you how I use this, like, design, execution review workflow, and I need to make sure that the GPT designer for model talks to the Claude Code model agents, yeah? So they... you can... and sometimes I cross them, and it's a mess. So I have these, like, different colors and different, agents, very separated, and I also have access to the servers here. So I typically have Sherlock and Gens, so that's the GSB server.
And also, on the... on the right, what I have is... I have a system that basically I organize every, step into an issue, then that issue has, like, subtasks.
The tasks are written by the designer, that is this GPT agent, and then I call that task and ask Claude to execute it. So I like to have these tasks on the right, and as you can see, I start, for example, task M means model, the version of the model simulation 3D, doing something with C, and then I have 01, 02, 03, because I'm iterating. And I see a result, and then I do a follow-up task, basically, that it's iterating over that.
So, this is our... and of course, I always have GitHub here to understand if I'm doing a big change, I'm going to do a branch, I work on that branch, and then I go back to the master.
Useful Commands
Okay, some commands that are useful, yeah, first of all, you should always use Claude with dangerously skip permissions. I know it sounds scary, but if you're using GitHub, it's fine, yeah? I have had... I haven't had any problem with that. Then what I do is that I rename the projects, because then you can go back, you can log out from the session, and then come back to this agent again, and it's going to, like, keep the memory in some way.
And then when you go back, you start, for example, you kill the terminal for some reason, and then you say, you say again, Claude, dangerously skip permissions, then it's not going to ask you every step to approve.
I give it approval, it has approval for doing whatever it wants. And then put resume, and put the session ID that I already renamed, yeah? So that makes it easier to go back, so always make sure that you have this blue thing that has the name of the session in the line of the terminal, because then you know that that session has a name, and then you can go back to that one.
Then a new super, useful, BB... no, sorry, it should have... it should have been BTW, by the way, and this is useful when the agent is running, and it's doing something, and then you see that it's going...
doing something weird and you want to ask, you can interrupt the agent, so you're not really going to interrupt it, but you say, by the way, and then you can ask questions from what it's doing. And sometimes I realize it's doing something wrong, I ask a question, I verify that it's wrong, and I kill it before sometimes my agents take, like, 5 hours to do something, yeah? And finally, my favorite is remote control, yeah? Who has used it? Yeah, no, yeah, some people, okay.
You put remote control, it gives you a QR code, or sorry, a link, and then you can use it from your phone. Yeah, so I've been skiing, and in the ski lift, I keep giving, like, instructions to the agents, yeah. These are my favorites.
Design, Execution, Review
Okay, yeah, I have this workflow that is, as I said before, I start with designing. And I go from doing an issue on GitHub, then separating it on tasks, and then having a prompt that is going to tell Claude Code which, like, tasks to work on. Then it's going to be this implementation phase in which Claude Code is going to execute, and then ChatGPT 5.4 is going to review, and this can be a loop. This can be, like, multiple execution revisions processes.
And then finally, once I do this deep review, and I see exactly what was changed, I accept, and I ask them to record it in the memory, yeah? Oops, sorry.
Okay, so again, planning, should be, like, a short, or should have, like, a big plan, but each of the specific issues should be, like, short. They should be bounded to a mini-project, break it into small tasks that have, like, sequential steps. One task is going to be bounded, sorry, assigned to one agent, and that agent should be a specialist in this task, yeah? Okay, you should read and verify the result against the real file. look at the data, look at the output, check what's happening.
And also, what's great about these LLMs is that also.
I add all these instructions at the end of, like, okay, plug this version, plug this other version, give me a plot of the whatever distribution of the results, and all these things help you check that the... that what the LLM did, sorry, what the agent did is right. And again, as I said before, document what changed, what was learned, what was uncertain, and what we learned that was wrong, what didn't work. You have to preserve that in the memory, and then you decide the next small step.
Yeah, well, again, as I said, voice-to-text is great, because I think fast, and I'm more creative when I'm talking. The result is messy, but it's fine, and as I said previously. I say, I'm going to drop a bunch of ideas, organize them, and verify that you understood. Yeah, so I also use this, methods a lot, saying, like, did you understand what I said? Can you explain it to me back before we go and implement anything? And the other thing that...
I've listened to a bunch of podcasts on AI, and people say it really works. When you tell Claude Code that this is a really important task, and it should put high effort, apparently it performs better.
I haven't tested it in a randomized way, but apparently that also works.
Yeah, okay, so here's kind of my flow in a more, like, visual base, like, I started with GPT, that plans and drafts, and people say that GPT 5.4 is better at logic and planning. Again, I haven't, like, fully experimented with this, but for me, at least it works really well. Then I review the plan, then I give the task to Claude Code. Then... I'm sorry, this is wrong. Claude Code, GPT should check the result, and then I do a final review, and that section there can be, like, some loop.
I can go back to... from Claude to GPT, and GPT to Claude, and then they keep iterating.
Specialist Agents
Okay, as I said before, I really try to make the agents specialists, yeah? So, some people want these agents that are master agents that do everything, and for me, and what I've heard from other people, is that this is the wrong instinct. One agent should have one job, should be an expert on something, should have all the context for doing that type of task. And you can have multiple agents, it's fine. At least my computer has not collapsed with, like, 10 agents running in parallel. They're doing different things.
Of course, be super careful with GitHub, what you're changing if two agents are working at the same time.
For that, I strongly recommend work trees, in which you make copies of GitHub in your computer and each of them work in a different work tree. I also do this, like, handoff of knowledge, not just the tasks, so I try to keep these documents and have all the knowledge of the project, and I give them to the agents when I create them. And yeah, you're going to have this, like, chain of handoffs that are going to be planning, executing, reviewing, and each of them is going to know the role.
Reviewer Loops and Memory
Claudia Allende Santa Cruz: Okay, yeah, basically this is, like, a little bit of the, a repetition of that, and why this works really well, and this I learned from Boris Chernay, listening to a podcast, he's the head of Claude Code. And basically, these agents, these ones are stochastic, and also GPT and Claude, despite them being, like, kind of, like, same level of performance, they specialize in different things, and if you run them multiple times, they're going to give you different answers.
So, you can ask the same question twice, and maybe you get different answers. So, getting, like, one to do something and the other one to review it can be extremely powerful, yeah?
And this is something that the creators of Claude Code are promoting and are saying that it also works well for them. Yeah, again, this is, like, Claude Code is a good reviewer, sorry, is a good executor, GPT is a good designer and reviewer. And once I'm sure about what I want, or I go to the final version, this is probably not great, I know why it's a bit pixel, but basically, I go line by line, and here you can accept one line at a time in Cursor.
That's why I really like this interface, because it lets me have a lot of control over the code.
Yeah, and again, memory... always remember that these LLMs can have, like. a million tokens, or a million lines of code, that is equivalent to 3,000 pages. So, keep this memory, and always give it back to the agents, yeah? So I have this, like. Ground truth files, for example, that have, like, everything that the project has, the past versions of the paper, the new versions, the literature reference. I have rules, yeah, and the rules have reasons.
So, for example, the rule is, like, never do a commit that I haven't approved. Every time you are doing... don't put emoji on the code. That drives me crazy. Failure history. What didn't work in the past?
Yeah, we ran this thing in the server, we didn't put enough time, it didn't work. And where does this live? In these MD files that I have many of them. One is Claude, the other one is the agent, each of them has an MD file, the Cursor rules, etc, and they're all organized in specific folders.
What Not To Do
Okay, so, just to conclude. What should you never do is, like, ask this, like, big task, like, fix everything, or do this, like, redo the project. At least for me, that hasn't worked. Don't accept code that you don't understand. Don't accept a summary that you don't know where the source is from. Don't let one task start, like, jumping into many other tasks, and, like, because sometimes you realize that there's an error in the data, and you're like, oh, fix the data now.
And no, don't... don't fall into that temptation, finish that task, and use the right agent for the right task. I never write a paper or an analysis based on results that are unstable that are still changing.
And basically, my recommendations of what you should try, always start small, experiment. These tools are changing weekly, and again, I may give... I may give a talk in two weeks. saying something completely different to what I'm using now. Get the good models. Of course, this is tricky because it's expensive, but it really pays off, and I hope that the university, we can work on this to get more, like, institutional access. Learn to verify, auditing is a core skill that I think is going to be very important.
And again, rethink of this division of labor. So, each of these agents is a collaborator, but you don't have a master collaborator that can do everything.
You are going to have these specialists that are going to help you for very specific tasks. Okay, thank you.
Disclosure, accountability, standards, and synthetic subjects
The Q&A turns to authorship and AI disclosure, changing expectations for what counts as thorough empirical work, model costs, and whether simulated subjects have a place in social science.
Read transcriptHide transcript
Michael Tomz: Thank you very much, Claudia. At this point, I would ask the other panelists to come join up at the front for a Q&A session.
Disclosure and Authorship
Michael Tomz: Great. So, Guido and I agreed that we would take turns reviewing the questions that you submitted and posing some questions to the panel, so we'll start there, and if we... I doubt we'll run out of questions, but if we do, we'll ask you for more. Guido. Okay, yeah, so, there was one question about, pros and cons of disclosure of AI use.
The person wrote, it feels like the right thing to do is to disclose the use of AI And some journals and professional associations encourage or require it, yet they still feel stigmatized. I... I, I served on a committee I complained about IRB at Stanford.
And as a consequence of that, I was awarded the prize of serving on a committee about research at Stanford that ended up being super interesting, because one of the things we discussed was what Stanford's authorship policy should be. And one of the questions was, how should that be navigated in the time of AI, and so we actually went around the university and talked to people about what they thought the rules should... what rules Stanford should put in place around AI.
And we pretty quickly agreed that it wasn't a good idea to make AI a co-author on papers, which some people are doing. Because the AI can't... the principle we proposed was, like, AI can't be held accountable.
Claude can't be, like, held accountable if Claude messes something up, so it doesn't make sense for Claude to be a co- author. I agree, most people's instincts then went to this middle ground of, like, you should disclose it, but not be a co-author on it. But then there was actually one really thoughtful person on the committee who was like, I actually don't think that makes any sense. you, the author, are always responsible for everything you put in your paper.
Like, why do I care what underlying source it came from, whether AI or otherwise? Like, you have to stand behind it. If you use AI and you mess up, and you have some hallucinated citation or whatever, hypothetically, then...
that's your problem, and so you shouldn't feel like somehow by disclosing that you used AI, you've somehow indemnified yourself against screwing things up in your papers. I thought that was kind of a compelling argument. I'm still a little bit on the fence, but... I'm very confident that 20 years from now, no one will be... disclosing. It'll be, like, disclosing that you used a computer instead of a typewriter or something, but... Yeah.
Okay, I'll jump onto that. I've been thinking about this from a couple of different angles. One is, like, right now I'm teaching, and I'm in various teaching notes. I'm making more teaching notes than I would have before, because it's cheaper to make teaching notes, and it's cheaper to make them look pretty and get them proofread and everything else.
And then wanting to disclose, it's an AI open class, and we're discussing how we use it, and I... and when I write down a disclosure, I'm just like, this is gonna look so quaint in 10 years, but yeah. you know, you... you... there... even if you...
if you let it go through and make 20 edits, there's a good chance, if you're not very careful, there's gonna be some LLM tell left in it. So, it's a weird equilibrium we're in right now. Like, is it okay or not okay? And in what context is it okay? And how perfect do you need to be in making it look as if you wrote every word and edited every word. So I predict we're gonna come to some new equilibrium, and there's gonna be norms about, like, how polished something is, but we haven't quite gotten there yet.
A second point along these lines, when I speak in conferences about productivity, people are, you know, whoa, why isn't it making everybody more productive? Why isn't it showing up in the economic numbers? And I say, well, hey, it's gotten really easy to write emails. Isn't that awesome? But who's reading the emails?
Well, my inbox is full of people who want to write me an email that says I'm the perfect person for them to work with, citing many of my papers, even though they're a, you know, a high school student in some other country. So that, like, of course, now I delete them all, and I used to read them. So, we're gonna need to get into just new equilibria of communication.
Writing paragraphs used to be a signal that you had thought hard about something. And also, writing paragraphs used to be a forcing function to make people think about things. So if you asked a junior person to write you a memo, it was like saying, don't just whip something up, it's actually really painful to write a paper, a pair, a page, so you're gonna think before you write it, because you don't want to rewrite it, so people thought. Now, all of that is just gone.
So that's gonna be true for our regular communication, but it may also come back into our academic writing as well, like, do we actually make shorter papers? Now. You know, shorter papers, longer appendices.
I, you know, for my teaching notes, I gave very long appendices and said, hey, now you've got AI. I'm not embarrassed to give you 5 different supplements, and 3 of you can read them, and you can use AI to talk to them. But that's, like, just a completely... so the summary is, I think we are going to completely redo our communication in a world where it's free to spew out text at everybody.
Others on the... others on the panel want to speak to this, question of disclosure. Can I say something very short? So for my PhD class, I asked the students to disclose, explain, basically, how they have used AI, but it was not going to be considered in the grade. So I just told them, like, I just want to know. And actually, there was a lot of variance of what they were doing.
Some of them said, like, oh, I use it to, like, check grammar and... spelling, and other ones were like, oh, I did the full simulation of this with Claude Code. And actually, the ones that only use it for grammar, I'm like, I'm disappointed.
Like, I want people doing more, so, like, in some sense, I feel like you shouldn't be embarrassed, like, if you're... the things should be right and well done, and you're responsible, but... I think the fact that you're using it is great, so that was an impression that I had with my students. Sorry if one of them is here and they only need that.
Evaluation and Career Standards
These answers feed into, I think, a second set of related questions that came up, that you submitted, and that's about how AI is changing, evaluation and career promotion standards. So, many of you are graduate students, and you're working on your PhDs. Is AI changing the standards by which, we're evaluating students to think about whether they have earned a PhD or not? How will it affect getting a job? earning tenure, and the standards that apply to that. I'd be curious to hear the views of the panel.
I mean, I think, I think Mike... Quick starting point would be, it is not changing... them very much yet. I think that... I don't see ways in which the conversations we're having, or the... kind of overall standards we're thinking about in those decisions have changed substantially, and I think that's in the category of these things that change very slowly. Like, which kinds of publications are credible, for example, which kinds of... how do we weight different kinds of publications in...
different kinds of contributions and promotion decisions, I think those are things that will change as, you know, if somebody does what Andy's doing and releases a new, short, little, interesting thing every week. for the four years of graduate school, or for the time that they're an assistant professor, I think we need to catch up and figure out how to value that correctly, but that's... those kinds of things are gonna change. Relatively slowly, I think.
Let me make a counter- argument to Matt. I agree with him that it hasn't maybe, like, changed in the last cycle, but I feel like every time I use AI for my research, I'm like, oh my gosh, how is this going to change how I evaluate young people? So in technical fields, like in econometrics or in theory, there are often long appendices, long technical appendices, which kind of look impressive, but if you are actually in the area, are relatively routine. But yet, it takes a certain...
Like, there are a bunch of people who can't write a 50-page mathematical appendix, like, and so just...
even if it's relatively routine, it didn't get you a job at Stanford, but I think it was still, like, a signal of technical capability that you could write 30 or 40 pages of proofs neatly and have them be right and well-organized and clear. I need, like, that... like... I think just now has zero value. Like, if they're... if it's tweaking something and running through the whole proof, that's just gone as a signal. And I think it's... for me, it's gone now.
Like, next year, if someone has a paper like that, I'm not giving them credit for that. So, maybe Matt never did, and, you know, I think... like, I think many people... Choose our advisor wisely.
At some level, like, you shouldn't have, you shouldn't have, but we did. Like, it was a... it was a really cruddy shorthand, but yet it was used, and I think it's dead. So, some of this advice of, like, you must write a technical paper to show your technical capabilities, or something, that advice is dead, and I don't really know what replaces it.
Hedona had a paper once where we had an empirical application and, like, a 25-page routine proof in econometrics, and Econometrica made us take the empirical application and put it in an online appendix, and they published a 20-page routine proof in the journal. And at the time, I was like... That wasn't really quite routine. Well, okay.
It was hard, okay. It was a routine. It was hard for me. But it was, nonetheless, it was... it was not interesting to read. I agree on that part. Routine is a little... the story's better if it's routine, though. So it was a somewhat challenging technical proof, which Guido did more of the work on, so thank you, Guido. But, like, that, you know, we should test and see if it could do it.
Of course, it's probably read the econometrica paper, but, you know, I think it's just, that's gonna get downweighted a lot, and I have found certain parts of mathematics and theory... I just did a macro paper last summer for the first time. And, like, I... I didn't know how... what happened when you change all the assumptions, and I asked it to solve it 10 different ways so I could just get a feel for it.
You know, that... a lot of this stuff is just routinized, so... so... but it's great, because that was really boring anyways, and, like, crunching through was boring, so it's back to being about the ideas. I think in the top 5 schools, it was always about the ideas. But I think as you go down the distribution, this... there was a signaling, so we have to find a new way to signal.
Yeah. Add a little bit to that. Research has always been about doing things that haven't really been done before, and now, kind of, we're expanding what is actually possible, kind of the same way you kind of saw in the early 2000s, people publishing papers with very simple experiments. you wouldn't be able to publish, those now, because things need to be more complicated.
And so, kind of. seeing what Nichim was, was showing here about these replication things, if someone comes up with a new econometric method that applies, sort of, in difference-in-differences settings. I think the journals and referees are going to insist on you applying that to 100 papers. where that setting applies.
Before, sort of, 6 months ago, that would not have been possible. That would have taken years to do. But now, you can actually do that, but it takes new skills, and it takes new effort, and so the standards are always changing. and the leading research needs to reflect that. And of course, for that type of paper, there's not even a question of should you disclose you're using AI. You cannot do that without AI, but it makes the quality of the papers much better. Anybody else gonna reflect? Okay.
Private Data and LLMs
Ito, would you like to pose the next question? Sure, the one that had the most upvotes, actually, is how can we take advantage of code if we have private data sets? Can you talk... can you guys talk about data privacy and LLMs? My first inclination was actually to follow Rose's suggestion. Ask the LLM what to do, and I did that and seemed to come up with sensible answers, but... In that spirit, I think if you asked it, it would say, just make some fake data and give it to the LLM.
If your column headers are not confidential, just copy-paste the column headers in. I want some fake data, I want this many observations in this file format, and now I want you to do an analysis on it.
If you're cleaning data, then that may be a little bit more challenging. You'll have to replicate the errors in your existing data set. But I think that is, like, the most foolproof way, where you're not taking any risk at all, it's just giving it fake data, and not even giving it, asking it to make fake data.
But, I mean, so I guess... If you... there are... there are... we actually saw a little bit of a divergence between the two of you, I think, in how you were... or a couple panelists, and Claudia, I'm a little closer to Claudia myself in using these things, but... You know, it's still a great helper if you are the one running your code. you know, I mean, if you have a GitHub setup, you can write code and, you know, push commit, and then be running it somewhere else.
So, I mean, that's our... in my lab, like, we write code, the code is committed to GitHub, but then, you know, you might be running it on Sherlock, so...
Sure, it's, like, magical to actually have it run the code for you, and you can walk away and take a walk and come back, but... It's still very, very helpful. If it never sees that. And if it's a column header issue, that's... you already suggested how to fix that.
Yeah, that's been my experience. I work with data from companies, and it needs to be in their server, or Databricks in a safe, environment, and I develop all the code locally with exactly what you're saying, fake data, and then I run it in the server, so... You still haven't been able to... I think you can't, at least for Sherlock, to have the agent run things on Sherlock. Yeah, yeah, no, you're losing out on some automation there.
I try it through a terminal, and it says, like, I can't see the terminal, but I cannot execute things in the terminal. So I think that there's, like, multiple steps. There's, like, just writing code. and getting help with writing the code and then pushing it.
And then there's, sort of, having it test its own code, which then, like, a fake data set locally will really speed up the quality, because then you don't have to, like, test it and find the bugs yourself. And then there's, like, actually having it You know, run an agentic flow for you. But I guess, I mean, not that I'm doing a huge amount myself, but I... I like to actually... I'm more like you, like, I'm a bit of a control freak, like, I want to see what's happening.
Like, the reason I'm interacting with it is I want to see the results, and I... and I want to control it. So I think it kind of depends on your task, also. Like, if you're trying to prototype a bunch of stuff or something, or do something really quick, one-off, that's where you really want to give it control and let it go. If you're doing a complicated project with lots of people, with collaboration, you know, you tend to be a little more careful.
I think we are probably all micromanagers of our AI here. I have gone as far as to have a Jupyter notebook, which is code, and then graphs and things like that, and I had it make a HTML UI, and I checkboxed every line that I wanted to keep in my final output. And so, I...
I think that taking a step back, broadly speaking, I think having these AI tools, it rewards the existing creativity that you already have that's within you, and the tool is almost... and the workflow that you will ultimately settle into is almost an extension of yourself and your own tastes and opinions, and it amplifies it.
And, you know, you being Stanford students are bright, intelligent, you're a grad student, you have a lot of time on your hands to experiment, and so I personally think that even though there is a lot of, like, fear and anxiety and stuff, like, you are perfectly positioned to take advantage of this moment, and I really hope you do, and just beg for as much credits from your advisors as they will let you have.
Hallucination and Verification
Good. Let me ask, move the conversation in a different direction to ask about, hallucination and AI. And a question that came up, that was crowdsourced is about how to either prevent or discourage AI from hallucinating. Rose, I, I tried... ask your AI how to get it to stop hallucinating, and to me it suggested, well, why don't you put some rules about that in CLAUDE.md, which I did. No hallucination. It did not listen, and I called it out for failing to listen. I said, why are you not complying?
I said, I do not know why I'm not complying. I guess I should follow the rules, especially since I helped write for them.
What practical advice do you have for those in the room who want to use AI and want to minimize the amount of hallucination it does?
Yeah, so for me, what has been very helpful is this reviewer workflow, and it's funny, I feel like once I had, like, I was doing something with Claude Code, and with Codex, and then having a Claude agent review their work, and I quickly realized which of the two was Claude, and it was like, yeah, that's right, and with Codex being, like, way more critical, I don't know if it was just, like. coincidence or not, but I feel like when you get these different, like. like, stochastic processes that are getting to a solution, they are really good at criticizing each other.
Of course, it's not always perfect, but it has happened to me a lot, that Claude says something, and it's like, oh, I did this, and actually it's right, because of this, of this, of this. In a way that's very subtle, and then GPT says, like, no, no, you're going too far with this, this is not the conclusion that we should get to, let's take a step back and, like, nuance that a bit more.
So I think, for me, that's been super helpful, and also, I feel like they're hallucinating way less when you give a task that is very specific. If you let them go and be creative and...
take control over your research process, they can start doing things that you don't want them to do, but if the task is very clear, and you put all the things that you don't want them to do, I think it's... you put the guardrails before.
I was... I was gonna ask Andy, like, did you experiment with the vote-by-mail extension in... you had grad student cross-checking, like, with cross-checking with other agents. How successful was that? Yeah, at the time, it wasn't. So, I had GPT, whatever the most... I think it was 5.2 Pro at the time. test... look for Claude's mistakes and didn't find them. That might be quite different now. Or if you replicate, if you repeat the whole task... Yes....multiple times.
Yeah, it didn't make the mistake every time that I retried it.
Yeah, and more generally on hallucinations, I think, using the fanciest model is probably the most important thing to prevent them, and they're getting way better over time. Yeah, and I think also, I mean, you can see some of the tricks they're putting in themselves, like, they start putting in the DOI numbers with hyperlinks and checking them, so you can see the planning agents checking their own work.
I think some... some of the models, clearly, there must have been some benchmark that they wanted to win, and, like, hallucinating academic citations was a problem, because I feel like it just overnight kind of got a lot better, and, like, the DOI numbers seemed to be the trick.
But there's other kinds of tests you can do, like, if you specifically ask it to loop over your citations and check them. you know, and have a way to check them. I'm not really having trouble with academic citations anymore. Same. Or links. That seemed like partly an artifact of, like, not being able to search the web. Yes.
So, but I do think with some of these agents, they are trying to be... they're trying to economize on resources, so sometimes you have to specifically ask them to do something that's expensive. So that's just something to keep in mind when using them. Like, if you wanted to search the web, you better tell it to search the web, because it'll... that's expensive, and it doesn't want to do it unless you make it.
I didn't... I say some things like, this is an extremely important task, it's critical for the project, so put high effort on it. And check your output twice or three times, and I think it works. People say that it works. At 5.7, there's even extra high... Yeah. I think people... someone said that you can put a backslash 10 times... Yeah, Ultra, and it puts more... but it's burning more tokens.
Right, but it's partly their internal thing, and actually, a former PhD student from here at GSB is in charge of allocating GPUs at OpenAI, so, you know, this is a thing that you might worry about in the future, but I think it's also that they're trying to maximize the user experience by not making it wait too long. So partly, this is, like, personalization towards you, you're saying, I'm willing to wait.
So, it's, of course, going fast, it saves it money, but it also... they're worried that making you wait for a perfect answer might make you sad.
I have just one, like, general comment to make, so I'm a little bit troubled. Three of the four questions so far have been sort of, like, ticky-tacky concerns about specific things with AI. Like, these are all problems that humans have, too. Like, humans hallucinate all the time. They violate privacy... my wife's a privacy lawyer, she can tell you lots of stories about humans violating privacy law.
I just hope some of the questions are gonna be about, like, all the important things we need to figure out how to do with AI, and not...
Big Picture Questions
Michael Tomz: the most stereotypical academic questions about, like, oh gosh, we gotta make sure these minor problems along the way get solved. Like, there's much larger things we need to worry about than disclosing our use of AI. It's an important question, but it's...
I hope in the last 15 minutes we'll get to some more, like, we need to completely rethink how we do all of our research, or maybe academia's not going to exist anymore, instead of, like, oh gosh, how... what footnote should we put on the paper? That's my hot take.
Andy, do you think that in the future, academia is going to continue to exist? I think so, but we need to be... yeah, we need to focus on the things that matter. What do you think? I think academia's gonna continue to do this. I think we should focus on the things that matter. And I think it'd be great to talk about the big picture questions as well as the little questions.
Game Theory and Theory
Alright, so here's another... sorry. Go, go, go. Another... maybe a bigger picture question. There's a panel about, say, using AI for empirics, but there's a couple of game theorists on the panel. I'm curious how much, or if you trust AI for you... for solving game theoretical problems. And how do you think it will change game theory, or maybe theory, kind of, more generally? And so this goes back, kind of, to what Andy was saying about actually doing new things, so that wasn't really great there yet, but it's...
Presumably it's going to get... get even better. So, given that Annie was asking for bigger questions, so... What do you think?
Well, I'm in a group chat with a bunch of econ theorists about AI, Ben Golub, Annie Lang, and others, they seem to think it's a huge deal. Yeah. I'm not a theorist, though, so... I don't know if I was the other game theorist or not. But anyway, the... I think the... I think we should come back and, like, ask different questions.
So, there's, like, a strand where we're using them as experimental subjects to play games, and I think that's... interesting, I think it still needs... like, that research agenda needs a little bit of structure on it before it starts making really great progress. there's, like, the proving theorems part, but I guess I would love to see new questions.
Like, in the end, you know, and also, insight, right? Applied game... one of the problems, actually, with applied game theory... some... things get intractable very fast. And so we've just stopped because our heads explode, and then... but then stuck with, like, slightly too simple models. So it would be interesting, I think, if we were able to somehow harness LLMs to help us navigate through the messy complexity to get richer insights at the other side.
I think, in general, theory wants elegant models so they can understand what's going on. But one of the reasons I stopped doing it is that, like, you know, one of the last really rich theory papers I wrote was just horrible algebra. And it stood... and it wasn't exactly telling the story I wanted to tell, but I couldn't get the model to tell the story.
So, if it could help me with that, if it could help me have a richer, more realistic model, and sort through all the possible assumptions and everything till you found something that was both understandable and telling the story I wanted, that would be very nice. That's gonna still take a lot of human ingenuity, so I don't think that's, like, completely automatable. That doesn't fall in the category at all of, like, oh, this is, like, kind of a long proof... that the ChatGPT could write without me.
It's more about crafting a model that tells a story. And there are many complicated things in the world that are not modeled. When I teach my MBAs.
There is no paper for most of the important things I want to teach them. In applied game theory. So, that's like a big gap.
I mean, I meant two... just two thoughts on theory a little more broadly, so... one... I think there are a few reasons I'm kind of bullish about the importance of theory going forward, and one of them is A lot of the work that we do, and certainly a lot of the empirical work we do, is very good at looking backwards. And trying to figure out what has happened in the past.
And I think the relative importance about figuring out what happens in the past relative to, like, what's happening in the future changes as the world is moving faster and faster.
And I think there's just kind of fundamental sense in which thinking about the future, which certainly economics is not very good at for the most part, and I think a lot of our disciplines have not been great at, and I'm not sure how we get really good at it, but it inherently involves extrapolation, and models are a really important part of how do we think about how things might change in the future.
That's one... thought about theory, and we know that's important for the... think about how we... how we think about climate projections in the future. That's kind of fundamentally about modeling. The other thing I think... You know, in thinking about how do we how do we do theory effectively?
I had a really formative experience when I was in graduate school, which was that Al Roth who was at Harvard, where I was at the time, and Paul Milgram, who was visiting for the year, taught a class, a PhD class, on market design, which I think was, like, one of the first... it was the first time they'd taught together, and I think one of the first classes on market design that was taught, and... you know, they were talking about both Al was talking about designing the medical match for residents, Paul was talking about designing auctions.
And I, at that point, had been, you know, kind of focused on micro theory and interested in theory.
And it was this kind of eye-opening thing for me how... different the theory enterprise was when it was tethered to a very specific, real-world test. So we're not just, like, trying to understand what is true about the world broadly, or trying to prove elegant results.
It's like, we have to design a mechanism for figuring out which residents go to which hospitals, and we need to understand what different mechanisms are gonna do, and what their strategic properties are gonna be, and how gameable they are, and we have some theorems, but the theorems don't quite apply to these cases.
And the, like, theory played a really important role in actual concrete design of both of those kinds of institutions. And I think that my bet is that kind of multiplies up as we go forward.
The space of, like. maybe we could extend the model in this direction, or this theorem assumed something was convex. What if we assumed something different, just for the heck of it, to see what happens? I think that kind of work, you know, its price is going to go way, way down. And so the coming back to, like, what is the objective? Why are we doing this?
I think when theory is tethered, then the AI and the model's helping us do it better when there's a concrete question and a measurable, concrete thing we're trying to do, I think is going to become a lot more powerful.
Maybe just to pick up one last thing on this, I'm working right now with the World Bank, I'm the faculty advisor for the World Development Report 2026 on AI. I'm working... a chunk of the people are empiricists, very careful empiricists. who really, like Matt, don't want to talk about the future, but yet every country in the world has to decide, like, what is my AI policy? What's my strategy? Do I build a data center? You know? blah blah blah. And so, you have to answer.
And I've never, even though I'm not writing many theory papers myself now, I'm using theory so heavily in this. But it's a... it's sort of not a single theorem, it's like...
it's like pulling together, synthesizing many fields. Like, to describe the growth impacts of AI on economies. You need to understand, like, micro-adoption, you need to understand management practices, you need to understand growth and innovation models, you need to understand reallocation of factor markets, you need the automation literature, and the trade literature, and, you know, wow! Like, my head's exploding with all the theory that I need to integrate into, like, advice. But the...
I am finding AI incredibly helpful to summarize these disparate literatures and pull them together. I still have a lot of thinking to do, because nobody's written the synthesis. Right? But, like, it would have been so time-consuming for me to master all of these existing literatures, but also, like, the theory is just so important. Like, if you're not using economic theory, you just spout nonsense about the future. Like, that there's gonna be no white-collar jobs in 18 months.
And actually, that nonsense has political consequences.
It causes votes, it's freaking everybody out, you know, and you need a little theory and a few facts about the world, you put them together, you can rule out certain outcomes. And that's really important. I think a cool example people can check out that I'm a big fan of, at least, is Alex Imus at Booth. Runs a substack called Ghost of Electricity.
He just put out a large piece the other day, all theory, on what he thinks... how AI's gonna play on the economy, precisely at this point, why he thinks these claims are crazy. So it's a good example of how theory, yeah, can play a very urgent role.
Human Subjects and Simulated Evidence
Alright, Andy, I'm gonna throw out another... a big... what I hope is a big picture question that's worthy of your hot take. Okay. We've been talking about how AI can augment, or maybe in some cases, replace, what human researchers have been doing, so replace the researcher. What about replacing or augmenting human subjects? and social science research. Is that a big enough take for you? Do we need to study humans, or do we just need to study the output of AI?
I don't know yet. I'd be curious what other people think. I've been running a little quiet audit. There are about 20 startups, all claiming to do this in different ways. One of them just had a pretty high-profile blow-up, I won't name them, where it was, like, pretty clearly nonsense. I'm pretty skeptical of it.
I'm friends with a political scientist who's obsessed with this stuff, Yamil Velez. he's been testing these things for a while, and for a while, he's been very on the ball with AI, and he said for a while, like, you know, sometimes you can make the overall distribution look right, but when you start getting into all the different conditionals, like, it doesn't look right. I just...
yeah, I don't think it makes sense to me. I don't believe it. My hot take is it's all wrong. But, I'm open. I want to test it, and, you know, I'm talking specifically about the study of political attitudes, or simulating political environments. I don't think today the tools are very good at it, but I haven't tested all of them, so... and I'm, you know, open to being wrong.
Can I layer on that? I run a lot of experiments in my lab. And, one fact about running experiments is that you often know almost everything except the treatment effect before you run the experiment. Although afterwards, you come up with additional analyses to do, but for your primary analysis, like, you can figure out almost everything.
And so I have this rule that I've had since the 1990s, which is that if you're gonna do a survey or run an experiment, you should, like, make a table of what your results are gonna be. You know the standard errors already, you know the sample size, because you designed it, so the only thing you don't know are the effects.
Most of the time, your first draft of that, you realize the thing you were gonna run is uninterpretable and not what you wanted to do. But people never... it's... I personally, but almost dozens of people over many years that I've worked with, cannot put themselves in the shoes of realizing the flaws in their design until they see a table. So then my rule is, okay, simulate that, make a fake table, and criticize it, right? But nobody wants to do that. Everybody gets mad at me every time.
And they don't do a great job at it. And when I say simulate data, people also get very blocked on that. Well, I don't know how to simulate it. What's the data generating process? They get kind of stuck.
So I do think, even though in principle, everybody should know how to write down a data generating process, and that should be part of grad school, because it helps you think about data, in practice, people get stuck on it all the time. And if you can just like, test your code on an LLM and it's very fast, then you skip this step that people get stuck on, and you think.
So, now, that's not the only thing you should think about, but I think that it... in practice, it just lowers the cost of, like, testing and experimental design, and accomplishes the goal I was trying to accomplish. And people may actually do it, because people somehow don't do it, even though I'm paying them and I...
asked very nicely.
I know we're out of time, but I can't resist saying... so we do that too, and actually put... have in the past put those tables and figures, the kind of simulated tables and figures, in the pre-analysis plan to say, like, here's what we're planning to do. And so we had our pre-analysis plan in the appendix of a paper that we were publishing in Science. And it said in, like, big, bold letters at the top of that section, like, this is...
You know, simulated data, hypothetical data, you know, showing what they look like, and we had a referee come back and say, this study is completely unreliable, there's a whole other set of results that the authors ran that they didn't discuss in the paper at all, and have completely different results. So be a little careful, but...
Wrap-Up
All right, well, we should wrap up here, but thanks, everybody, for participating. This was amazing, yeah. Let me just, join in thanks. I want to thank Rose for the inspiration to have this event in the first place. Thank you, Rose. Thank you. And Guido, for your leadership in organizing it as well. Please join me in thanking our panelists, for their terrific presentations and for engaging these questions. I really want to thank our terrific staff.
Their names all start with C and K, so we didn't plan that in advance, but that's Christopher Fraga, Karla Flores, Kali Zappalla, and Kate Green Tripp. Thanks to all of you for all that you did to make this event a success. I really appreciate it.
And to you, the attendees, I would say, thanks for coming, and I hope that this is not the end, but the beginning of an ongoing conversation about AI and social sciences. We'll send you an email afterwards to solicit your advice about follow-up events, whether they're workshops, tutorials, other types of events that can be helpful to you.
As you heard from the panelists, this technology is changing by the day, so what we say today will need to be updated next week and next month, and we'll look forward to continuing to engage with you and all of that. So thanks very much for coming today.
Group workshops on agentic coding for economists.
Aniket has worked with over 100 economists across individual and group sessions to get them up to speed on Claude Code and Codex. Interested in having him work with your department or institution?