
Ep. #7, Truth-Seeking Data Systems with Bryan Bischof
On episode 7 of Data Renegades, CL Kao and Dori Wilson sit down with Bryan Bischof to explore his journey from pure mathematics to building real-world ML systems. They dig into recommender systems, surprising data bugs, and why truth-seeking should guide data teams. The conversation also covers modern data tools, visualization limits, and the realities of “self-serve” analytics.
Bryan Bischof is the Head of AI at Theory Ventures and a seasoned data and machine learning leader. He has held senior roles at Weights & Biases, Hex, and Stitch Fix, where he built and scaled production ML systems. Bryan is known for his focus on truth-seeking data culture, practical tooling, and applied machine learning in real-world environments.
transcript
Dori Wilson: Hi, I'm Dori Wilson and welcome to another episode of Data Renegades.
CL Kao: And I'm CL, CEO of Recce and your host of Data Renegades. Today, our guest is Bryan Bischof, currently Head of AI at Theory Ventures, who has been at the forefront of building products and teams, leveraging first, well, machine learning and now AI.
Bryan's career spans building the world's first recommendation system for coffee at Blue Bottle and leading the deep ML work at Stitch Fix, building data teams at Weights & Biases, and most recently leading AI at Hex, where he shipped Hex Magic, their AI feature suite.
He holds a PhD in pure mathematics, co-authored O'Reilly books on recommendation systems, and teaches data science at Rutgers University. Oh, and in his spare time, he walks every single street in Berkeley where I live.
Hello, welcome to the podcast, Bryan
Bryan Bischof: Hey, thanks for having me, CL and Dori. Great to be here.
CL: Yeah, we're such a big fan of you.
Dori: Yeah, we're super excited that you are here with us.
CL: Can you take us back to the beginning? What problem first pulled you into the data space?
Bryan: Yeah. So I was in graduate school and I was doing a math PhD at the time and I was planning on staying in academia as a mathematician. And some things were about to happen in my life that were kind of like a little bit of a distraction.
But before that I had met this researcher who had also been a pure mathematician in algebraic geometry. His name is Chris Hiller and he was doing this really interesting work where he was using machine learning and mathematics, like geometry, to think about neuroscience. And so he was doing this really interesting work in neuroscience and he was kind of at the intersection of machine learning and industry.
And I talked to him and he had all these interesting stories about problems he was working on. And I was like, oh, that sounds kind of cool. And then a short bit later, I met another algebraic geometer who had gone into industry named Kevin Lin. And Kevin Lin also had all these interesting things to say about data.
And I was like, huh. You know, I didn't think there were any interesting problems in industry, but maybe there are some interesting problems in industry. And so that got me thinking, like, maybe there's something there. And then I ended up applying to work at a startup after deciding that I wasn't going to stay in academia.
And I ended up working in time series data. And so my first foray into data science, I was the first data engineer at the startup, which was acquired by IBM shortly thereafter, where I was building data transformations on streaming data. And so at the time I had no idea about any of this stuff, but they basically said, okay, we've got this streaming data, they're time series. How can you help us make decisions on this time series data in real time?
And I was like, oh, okay, this seems like a very reasonable thing. Hindsight. We're talking streaming data, they're time series and inference needs to be real time as well. So I was trying to do like inference on real time streaming stuff.
And then the thing that I haven't even told you is the scale of the data. The scale of the data was also terabyte scale, and about 12 years ago, terabyte scale streaming data was a real pain in the ass. And so again, I'm completely ignorant and I'm just like, okie dokie, this sounds fun.
And so I still remember Spark streaming came out when I was about like six months into this job and I was like, this seems relevant. I wonder if anyone else is using Spark for anything. And so yeah, that's when I started learning Spark. So that was my very first exposure to the data world.
CL: Wow.
Dori: And what language were you coding in?
Bryan: Yeah, want to guess?
Dori: R?
Bryan: Nope. Scala.
CL: Oh, okay. Yeah, I was actually just about to ask, coming from math, what sort of functional programming thinking would be natural to apply to all those kind of interesting industry problems. There we are.
Bryan: Yeah, yeah. I mean at the time, Spark streaming did exist in Python and it kind of existed in R, but Scala was the one that people would talk about as like, oh, it's really good if you do it in Scala. And so I started learning a little bit of Scala and I was like, oh, this makes total sense. This is great. The type system was really great. It just felt really easy to think of, you got this RDD, how we're going to chunk it up into these pieces, operate on this RDD.
It all just made sense with my math brain. And so to your point, CL, I do tend to like functional languages. They've obviously lost a little bit of steam lately, but I still do find them very natural. And I've learned a little bit of Haskell. I've learned, obviously, Scala. There was a short bit of time where there was, I forget what it's called, but there was this ML library that was functional. It was like F or something like that.
And I started learning a little bit of that because I was like, oh, Pytorch isn't functional enough. Actually, when people ask me my favorite machine learning or neural nets framework, I always tell them JAX. And the reason that I like JAX so much is because it's not really a functional language, but sure as hell feels like a functional language.
When you write JAX, it just somehow just like fits right into my like, weird shaped brain. And I don't know, it makes a lot more sense to me than like trying to write things in like--
CL: Imperative ones.
Bryan: Yeah, exactly. It's why I've never found Kairos very useful or even like TensorFlow is a little bit annoying. I just found them to be too imperative.
CL: And then. So you got into data in the startup acquired by IBM and then got into data and then what happened next?
Bryan: Yeah, I mean, so I built a lot of algorithms for taking this high speed streaming data and down sampling it and then doing inference. And then it was kind of time to move on. The IBM Borg process had taken place and the work had become useless.
CL: Not mysterious enough.
Bryan: Yeah, my time ended up just, like the value of my time was rapidly approaching zero. And so I started looking around and of all things, I at the time I was like, I'll look on Craigslist. I was like, I'll see if there's any jobs on Craigslist. Which is, this is like a totally true story, but it's insane.
I had applied to some other jobs through normal channels, but I had seen there was a job on Craigslist at Blue Bottle Coffee and I was like, no way. I'm obsessed with coffee. Like I was already really, really into coffee. And the job wasn't even like software engineer, it was data software engineer. And I was like, this is exactly the kind of thing that I'm doing. Like I'm technically a software engineer at this startup acquired by IBM, but I focus on data engineering and data science work and I had written some papers on probability and stuff as well.
And so I was like, man, this seems like too good to be true. So I apply, I interview, and sure enough it is basically just come build a data practice. And so I built the data warehouse. I stood up all the sort of stuff around the data warehouse. And I remember during the interview process, especially close to the offer, they're like, if you can automate our P&L, hiring you will be a success.
We think it'll take a year. We respect it'll take time, but basically if you just automate our P &L, we will be successful in hiring you. That's like your success criteria. And I'm like, okay. And so it only took me about a month and it was a hellish month, one of the like grindiest months of my career because I was like trying to read all this information from NetSuite and luckily I had this really amazing colleague Greg, and this other colleague Dave, and they worked on things kind of like on the back end of the system side.
And together we were able to build out this data warehouse and I was able to automate the P&L. And that was the beginning. And so from there it became like, how do we optimize everything? How do we think about optimizing the cafes? That led to me building a forecasting engine. So coming from a time series background, I was interested in forecasting, but I had never worked on time series forecasting proper.
And so I bought Hyndman's book and read the entire book and was like, okay. And I still have my copy of all the notes in the margins and everything. I read the entire book and fortunately in R and I was like, okie dokie, time to just apply this. And so this is how I got into time series forecasting. And we were forecasting everything from number of scones in the Rockefeller Cafe to where should we put the next cafe in downtown SF.
We did a pricing experiment which was really interesting. We did an availability experiment. One of my friends, Sean Taylor, who's this really great data scientist, he works at OpenAI now, he and I have talked about this experiment, and it's one that he's been really excited about because it's like, I had to do a physical experiment. I had to do something in cafes to run this A/B test.
And so it was really fun to try to figure out how the hell are we going to do an availability experiment with items at a cafe. I can't simply allocate people at the door of, like, no you get a scone and you don't. Sorry.
And so it was a lot of fun trying to brainstorm how do you solve all these problems. And one of the problems was this recommendation system for coffee. It was basically just like, we had this subscription program and people wanted to get the coffee that they wanted. How on earth are you going to get them a good coffee?
And we had a pretty high level of churn, I would say, in terms of, like, they order their first coffee and then they do not stay subscribed. And one of the questions is like, can we reduce churn? And so we ran this really fun survey which was like, you know, we sent a survey to hundreds of subscribers about, like, a bunch of questions, and we took those answers to those questions, which are not necessarily about coffee.
And we looked at, like, okay, what coffee do they mostly order? And that was our training data for my random forest. And that was actually my first recommender. And so, yeah, that was super fun.
Dori: Oh that's cool. What did you find was the biggest predictive feature?
Bryan: Yeah, your favorite salad dressing.
CL: Seriously?
Dori: Really?
Bryan: Yeah. Okay. So at the time, I was working with the coffee team, right? So Blue Bottle had this amazing coffee team. Like, a lot of people that are the top 100 coffee professionals in the world, a surprising number of them all at Blue Bottle at the same time. Really kind of the salad days of this kind of tech plus coffee, no pun intended, on salad dressing and salad days.
But at the time, I had this idea like, I want to build this coffee recommender. This is how I'm going to do it. I have all these great ideas. And I remember checking with the coffee team. This was not like, oh, I'm going to, like, work with the coffee team on this. I was just like, I should just check with them.
And I remember checking with them, and instantly they were like, "no, no, no, no, no. The way that you're approaching this is like completely wrong."
And I just remember two people in particular, Judith And Carly, they worked with me a lot on crafting questions everyone would be able to answer but would have really high likelihood of predicting what kind of coffee people wanted.
And so I just remember, it's really clear in my mind of Judith being like, so if you're a barista, you're really good at this game. You're really good at, someone comes into the cafe and they say, I want to buy coffee. And you ask them some questions. And she's like, these are some of the questions that you would ask in the cafe. But there's these other questions which we think are actually really interesting and useful that we don't normally ask in the cafe that we should.
And so Judith and Carly really helped me craft a reasonable survey, which seemed insane to me. And I was like, this seems completely like unpredictable. But it was surprisingly predictive. And so that was the first coffee predictor.
Dori: Oh, that's really cool. Also, I love that you took the qualitative knowledge that the coffee team had and made your data set stronger and more enriched. That's really cool.
Bryan: Totally. Yeah. I mean, it was one of those moments where I was like, going to Blue Bottle thinking I'm really into coffee. And then that was like, yeah, I realized that I was not as into coffee as I thought I was. And so I'm only more into coffee now. But it was a lot of learning and a lot of opportunity to work with actual professionals. And so that was great.
Dori: Yeah. I'm thinking, have you heard of a company called Trade Coffee?
Bryan: Mhm. Yeah. I remember when they first came out, I was like, hey, I've built this before. At one point, me and a friend who's also a Blue Bottle alumni, he and I were kicking around the idea of something very similar to trade. It was like, right around when they came out. But I think Trade's interesting--
Dori: Wait, real quick for the listeners. Trade Coffee is a subscription coffee service where they ask you questions and then tailor the types of coffee you want and send you coffee beans.
Bryan: Exactly.
Dori: All right, keep going.
Bryan: Yeah. Coffee shops are famously low margin, and coffee subscriptions are also famously low margin. And so there are some reasons to not deeply invest in this area. But it's one of my retirement goals is to start a coffee club.
Dori: Yeah, it's cool. I was just thinking about when you're saying like, salad dressings. I was like, Trade didn't ask me that question, it didn't ask me some off-the-wall stuff.
Bryan: Yeah, they need better data scientists. Haha.
CL: So you build this recommendation system and it's like combining this unintuitive input from the team and it has to impact real world customers. Right? So I'm curious, in your career, what's the most painful bug or failure you've seen in production?
Bryan: So later on, I went to Stitch Fix, which is a clothing company. They send you a box of clothes, and you pick some of those clothes that you keep and you send the rest back. And it's a really great mechanism for people that are not super excited to go to stores and shop themselves, or they want to just discover things that they maybe wouldn't be inspired to on their own.
So Stitch Fix had a really great algorithm scene. That's what we call the data scientists. And I worked on a bunch of different projects there. One of the things I worked on at Stitch Fix was sort of a Clueless Closet type thing. So you talk about your closet and you talk about clothing you own, and you can pin stuff and you can sort of say, like, oh, these are some things I found on the Internet, some images. And then we can recommend things that are in our inventory that might be appropriate for you.
You may have seen this, Steal This Look meme. I remember one of them that was really popular for a while was, like, the Bernie Sanders one. It was like Bernie Sanders sitting with his hands crossed and it says, Steal His Look. And on the side it shows you can get his jacket here and his pants here and his beanie here. But some of them are more ridiculous.
Like, there's ones where it shows a bird and it says "Steal Its Look" and it tries to show you all the clothes that would make you look most like this bird. Well, anyway, we built Steal This Look way back in, like, 2020. And it worked. It was basically like you would upload a photo and we would tell you all the pieces of clothing in that photo that were most similar to things that we had in our inventory in your size and appropriate for your style.
Because if you remember, Stitch Fix was really good at latent modeling of what people liked. That was actually one of the reasons our recognition systems were so powerful, because we had built a lot of things around latent style over the years. So I used these tools to build Steal That Look, with one of my colleagues, Ian.
And so what was interesting was like, all the metrics were quite good. And you know, the offline metrics before launch were inspiring, we'll say. And then we launched it and it was interesting. We definitely like had decent results, but we almost immediately identified that there was a problem. And the PM that was working with us at the time, Lila, she was like, why does this recommend so many backpacks?
I'm like, what are you talking about? And so she pulls up hers and sure enough it's like, backpack, next picture, backpack, next picture, backpack. There's no backpacks in the picture. And I'm like, the hell is going on? Why is there so many backpacks? That was already like a little bit of a red flag, because there's a backpack in every recommendation.
What was weirder was like, in some, there were multiple backpacks. And I'm like, I don't think this is even possible. I designed the diversity metric. I know that there's two kinds of diversity that are programmed into this recommender. There's soft diversity, or basically based on mutual distance in the latent space. And then there's a hard diversity which says you can't recommend two items from the same category. Category is like a special field that we had.
How is this possible? I'm like genuinely baffled that it's giving two-- Like backpacks and everything, weird. Two backpacks in the same one? Something is wrong. I shit you not, I found one where it gave three backpacks. And I'm like trying to create memes to share in the internal channel. Me wearing through backpacks, I'm like, this is bizarre.
So man, this was like at least a day of just staring very, very hard at all the systems and being like, I don't know how this is possible.
CL: Mhm.
Bryan: Any guesses as to what was wrong, before I tell you the answer?
Dori: Was the backpack data itself marked wrong?
Bryan: Kinda. So backpacks were one of the very rare things that were allowed to have multiple categories associated to them. Why? Unclear. But it was one of the very few pieces of garment that we would assign to multiple categories. So one, backpack's got multiple shots on goal. Okay? Two, and this is the thing that was even more annoying, we're doing all this in like a latent space for recommendations, right?
We're trying to learn in a non-explainable way what is going to be a good recommendation. We're going to minimize some distance in this high-dimensional space. Well, as you probably know, high-dimensional space can be a little bit odd. And so some of the high-dimensional geometry means that things that should be close aren't as close as you'd like. And things that aren't that close can be shockingly close.
We weren't using cosine distance. There were some reasons why we weren't using cosine distance. We were using more of, if I remember correctly, an L2 norm for this. And there was some specific reasons. Not for everything, but for one part of this recommender, we were using an L2 norm. Well, it just so happens that even though you expect most things to be on a shell, and so L2 distance in most cases is kind of close enough to, if the radius is big enough, it's kind of close enough to cosine distance, and so who cares?
Everything is happening on the shell. In high dimensions, everything's on the shell anyway. You know what's not on the shell? Backpacks aren't on the shell. Backpacks are in some weird part of the latent space where nothing else is except for backpacks. And that part of the latent space is basically close to everything.
CL: Wow.
Dori: Is it because the categories? That's fascinating.
Bryan: And so it had two problems. One, it was because it was in like this multi-category thing and because in the latent space it was just close to everything. And so I'm trying to figure out how this is possible. One, why is it so highly recommended? And two, why is it even giving you multiple responses?
And so, long story short, I hadn't looked carefully enough at every single type of data and what it could be in. Like there were only like two or three items that could even be in multiple categories. And I certainly didn't expect any of the items to be close to everything. But what's interesting about this is like, since then I've seen three or four examples of stuff that is close to everything by being so weird, by being so out of distribution.
You think of like out of distribution as everything's over here and it's over here. One of the things I started to learn about like high-dimensional latent spaces is that "out of distribution" sometimes means weirdly close to everything. And one of the ways that this has bitten me more than anything is not even that it's close to everything, but it's close to the null vector.
CL: Wow.
Bryan: Things that are close to the null vector has now bitten me twice since then. Recently I was working on this project called Semantic.Art. And to be very blunt, we had this situation where more often than I was very happy about, we would get NSFW content. And I'm like, why is this NSFW content? And it's not even like a bunch of random NSFW content. It's the same NSFW content. It's the same creepy photos that are like--
And for those of you that don't know, Semantic.Art is a search engine that I built with Ayush and Chung from LanceDB to search across a bunch of different art. And you can kind of put in anything you want. You can put in crazy shit and it'll still find good results. Well, with an asterisk. And what was weird is, like, there's not a lot of NSFW content in this data set. What's in there?
Well, there's some slightly creepy photographs that an artist took of some naked people. And there's like, this other one. It's like a painting of some naked people. And some of them are more and less creepy. And I don't have to get into all the details, but there's this set of maybe 50 images. And I'm like, why is this coming up so often?
And finally Ayush actually found the answer, which is like, sometimes we were erroring and that was returning the null vector. And all this NSFW content, by being way out of distribution, was actually close to the null vector. And so as much as I wish I could say, like, oh, I had the backpacks bug and I never again was hoisted on this petard, wrong. Actually it's like multiple times since it's been a recurring nightmare.
So anyway, that's my worst bug in production. Too many backpacks. I do have too many backpacks in my life as well, but that's an unrelated bug. My wife is very annoyed by that.
CL: This is fascinating. Does it have something to do with you moving a little bit upstream with tools, like, hey this is really hard to catch. Can we do better experiment tracking? And then, like, why was I doing that?
Bryan: Very astute observation. I was using Weights & Biases during that project, and I did see an early demo of a product from Weights & Biases called Tables. And Tables--
Okay. Try to imagine a world where you have a spreadsheet, and in some of the cells of that spreadsheet, you have pictures.
CL: Yep.
Bryan: Crazy idea. At the time, this didn't exist and it was even more difficult to do anything related to like multimodal stuff and be able to work with like your different experiments. Experiment tracking for this project was like, not trivial. And so, believe it or not, yeah, I was a user of Weights & Biases and that was very much like how I got connected with them and got excited about that product.
And much to your kind of allusion there, I basically joined W&B because I was like a fan of the product. I interviewed there to come and basically be an MLE because I was like, oh, I think I could build a lot of stuff on top of Weights & Biases as an MLE. And after chatting I was like, actually the most valuable thing they need right now is a head of data. And so I joined as the first Head of Data and hired three teams to basically build out the internal data product.
Dori: What made you realize that they needed a Head of Data and not just another powerhouse IC?
Bryan: I think when I asked them some basic questions about things in the business, there seemed to be a gap there. And I don't want to take credit for being the person who noticed this. The CEO at the time, he very much was like, I don't want to say "telling" me, but kind of giving me a hint that there was some opportunity for more internal data science and data analytics. And that led to us just talking about it openly.
And so I think he saw pretty clearly that on the operational side there was opportunity. And he was totally right. They had a consultant at the time in the data engineering side just doing some structural data work. And that was a nice way to bootstrap my organization. But there was a lot of work to do.
Dori: Yeah. What was the first few hires that you knew you needed once you were Head of Data?
Bryan: Yeah the first two hires that I made, or maybe it was even three, it was basically focused on like, I knew I needed a data engineer, like a full time data engineer. And I knew I needed a full time data scientist who was going to be focused on like analytics. And so those are some of the early hires.
I think if I remember correctly, I hired one data analyst, one data engineer, and one ML person. And they were there for a while and I think two of them are still there actually. And so, yeah, I think it was really, really critical to kind of get those key hires in place. We had a lot of basic work to do. Everything from are people using the product to do we have any deals in pipeline?
There was like a six month period where all of my team was focused on getting Salesforce data into the data warehouse. And I wish I could say, like, it was a very unique problem, but from every data leader I've ever spoken to, this is one of the core problems, which is surprisingly like, you would think that this would be like fresh information or stale, but actually Salesforce continues to lock down their data more and more and make it public harder and harder to like use with anything else.
And so, yeah, it's surprisingly pertinent to our everyday reality these days.
Dori: Yeah, it's so difficult. I've had to deal with Salesforce data and then trying to pipe data back into it is a nightmare. I mean it's just awful. And maybe, you know, you saw this some at Hex because I saw, I've seen Hex kind of going more and more into the go-to-market side and it's like, yes, it's just much easier for me to pull the salespeople, the customer success people to me and this wrapper on a Jupyter notebook that they can just use, than to try and keep piping in and out of Salesforce totally constantly.
Bryan: Yeah, I mean--
From my seat, I see so much value in being able to get your hands on this data and start screwing around with it and asking questions about it.
Like now we have this new terminology called like GTM engineer. And like GTM engineer has become like quite fashionable. And I think it's a good thing. I think it makes a ton of sense. But like, what was GTM Engineer before Clay decided to like make a name for it? Well, that was a data engineer and building internal tools around data that comes from Salesforce and building a data product for that, well, that was just the charter of the data team.
And so obviously you've got this interesting transition of just engineers are doing this and then we have specific data engineers that are dedicated to this kind of stuff. And then now there's a GTM engineer and now you're starting to hear AI engineers are starting to become responsible for doing this work.
Again, this is great. This makes a lot of sense. I think it's exciting, but it is interesting. It's a lot of the same work, just new tools, like, what are the tools that kind of like enable this? And so right now I think that's the trend that I'm seeing in terms of who's doing this work.
CL: Wow, this is fascinating, because you've seen this zero to a proper data system movie a couple of times, right? And in various different contexts. But if you had to kind of rebuild one piece of the entire stack from scratch, well, what would that be?
Bryan: Okay, so I'll say a couple things. Like, if you had asked me this maybe three years ago, I wouldn't have paused. I would have said the query engine. Like, I would have just, like, blurted out the query engine. And the reason is because, like, where do you do your queries? When I was at Weights & Biases, we were a BigQuery customer. Have you ever used a BigQuery UI?
CL: Yeah.
Bryan: It feels like someone, like, designed it to punish me. I don't know how when we met, I don't know what crime I committed in front of them, but they were like, I'm going to make it my job to make Bryan sad. And then, I feel like the Snowflake version, they're like, oh, that's cute. Like, are we playing a game?
It's like those JavaScript competitions, like, who can design the most painful calendar UI. I feel like the Snowflake team was like, oh, we're competing for worse query engine? We got you, like, hold our database query.
Dori: Hold our beer.
Bryan: Yeah. And so those are just offenses on the world. What's interesting is before I was a BigQuery customer, I was at Stitch Fix. And at Stitch Fix, we use Data Grip connected to Presto. And I remember thinking, like, man, Presto is such a nightmare to get working well. But thank God for our algorithms engineering team. Like, our platform team has made this work. And now I can use Data Grip, and Data Grip is kind of okay. It's not the most amazing thing in the world. It's just kind of okay.
So if you had asked me three years ago this question, I'm like, the query engine, they suck. It's so painful. But now I have two things, and both of these things are, like, they're somewhere in the camp of incredible to it's still hard for me to believe. And those two things are Hex and MotherDuck, and they're really, really incredible in kind of almost orthogonal directions.
And it's just a little bit surprising to me, if you had asked me three years ago, does this need to be one thing or two things? I would have said, I don't even know what the two would be. I was wrong. MotherDuck, I'm exploring the data with SQL in a way that feels so fast it boggles the mind. I don't understand how they make it so fast. And just doing quick things in MotherDuck, I'm like, holy shit, this is so great.
On the flip side, when I have this kind of like building up a really complicated data pool and there's other things that Hex is good for, but when I'm trying to build up a really complicated data pool, like I write a SQL query, I get a data frame, I write another SQL query on that data frame, it compiles that into a single query and I didn't have to think about it? Holy shit.
So like that is amazing. That's actually one of the reasons why I started getting interested in Hex even before I worked there. So those two things have like solved that question. And then of course now you layer on AI and things get really exciting.
Dori: Don't forget the dynamic variables of Hex. That was like, for me, one of the best things of like, okay, I don't need to figure out how to load this. I'm just going to write this little query. Got my data frame, I'll do my Pandas. Great, I can make this dynamic, make this an app. And now, boom, it's gone. My RevOps guy is happy. Such a big unlock.
Bryan: Certainly. There's a lot to love about Hex, I think. But even at this like simplest layer of like the query engine, I'm like, man, like that was the thing that was really making me sad, as a data scientist. I just remember constantly commenting out some lines of SQL, rerunning it. Fuck. If I comment these lines out, I have to change this other part of the SQL statement. Okay, I like run the. Okay, like so annoying.
And honestly in Hex I never had to do that. And that was like magical. And similarly for some of the like very rapid heuristics from MotherDuck, I'm just like, oh my God, it's so good. So that's what my answer used to be. I know this is a long preamble.
I think today I don't really have a good way to chart easily. I want to make charts, I want them to be according to my style, in branding, I want them to be in a very particular form. I want them to be print-worthy. I want basically unlimited flexibility. I don't want to drag and drop anything. And so I just built my own library for this because I was so frustrated. Nothing is very good in this space.
The chart builder in Omni for certain kinds of things for dashboarding, great. The chart builder in Hex for other things, also really, really great. I still think in Hex charts a fair bit, but when it comes time to, like, I want to make a chart for my blog, there's nothing. And so I built my own.
And so now I can write natural language, I can give it a data set, and I can get a chart that matches the theory of branding and uses the correct fonts and the right line widths and sort of the right color palette. And I can do anything I want. Like, if I want to commit some chart crimes. I can do that.
Dori: No pie charts.
Bryan: No, no. I mean, geez, I'm not a sociopath.
Dori: That was the worst chart crime I can think of.
Bryan: Yeah. Recently I wanted to make this, like, step leaderboard chart. It's like a dotted line chart where it steps because you want to see, like, progress over time. But I had to do like, 10 of them on the same chart. It's like chart junk to put them all on top of one another.
And I was like, "well, I know the rule. Never use more than ten small multiples. Oh, well!"
So I had ten small multiples. And how did I make it not look like shit? I don't know. You can judge if it looks like shit, but I just abused my chart library to make it look right. And so I think these are the things that, like, right now I still feel like this is really tedious. I think another thing is--
We have gotten to a world where there are more agents for data than there are flavors of SQL, and that's saying something. And yet I still don't feel like there's a data agent that really fits into my workflow organically.
I recently hosted this hackathon because there's not something that allows me to look at, like, data spread across multiple modalities and still allow me to do data work. I can do this data work with Cursor, but it's really painful for certain reasons.
And so one of my motivations for building this ridiculous data set is I have a lot of tasks where I'm like, "here's a bunch of PDFs, here's a bunch of log files, here's some structured data. Let me just do great data work on top of this."
And right now, I claim, and I've yet to be proven wrong, there's nothing great for this. And so that's something that I'm really eager to see. So that was why I built the data set that I built.
CL: Yeah, that hackathon was amazing.
Bryan: Thanks. For people that maybe don't know, I hosted a hackathon about a month ago where I constructed a data set of that type and then 65 questions about the data. And they're all hard, real data science questions that, like, as I've joked a couple times, they were inspired by painful questions I've gotten over the years and have ruined my Friday. And so I tried to harness all that pain into the questions for this dataset.
Dori: Yeah. When you're talking, though, about rebuilding the stack, I was hearing two different things. It's like the visualization, so kind of the data story, part of the data story and element, and then just being able to work functionally across mixed sets of data comfortably without having to code it all out, to actually structure it in a way.
Bryan: Totally.
Dori: So I'm a Plotly fan girl, so I'm curious about. Because Plotly, you can do the layouts.
Bryan: Totally.
Dori: What was wrong with Plotly for your needs?
Bryan: Okay, so Plotly, I used Plotly pretty heavily. For about a year when I was at Blue Bottle, I was using Plotly in a cafe one time. Not one of our cafes, like a different cafe. And I look over and the person next to me is also using Plotly. And I was like, oh, that's cute. And then I'm like, that's interesting. He's not using Plotly. He's building Plotly.
And I was just like, are you an employee of Plotly? And he was like, yes. And I was like, so amazing. I was like, I was just making a Plotly! Anyway, long story short, I think Plotly got a lot of things right. And again, in the age of LLMs, I think Plotly is different. Plotly's API has always been, like, again, adversarial. Similar to the query engines, where I'm sort of like, "oh, I wonder why they don't like data scientists and why they want us to suffer?"
And so, I've always found that the Plotly, despite having a lot of capabilities I want, I always found the APIs to be baffling. Like, I could just never guess how they behaved. Also--
I would experiment. I would ask yourself, how many times have you sacrificed your creativity and your inspiration because of the tools that you're using? I do this all the time. Like, every time I use a piece of software, I'm making sacrifices of my creativity.
But in charting in particular--
Dori: I've created fake charts in Figma. And like, I've used shapes to create fake charts because I couldn't really get what I wanted.
Bryan: Exactly. I feel like more acutely than almost any other type of work that I've ever done, that's been the area where I'm like, I can't get what's in my brain into this API. And what I would say is like, having just built a charting library internally, again, just basically for myself, maybe like one other user, like, wow, you could do a shitload of stuff with Matplotlib if you let Cursor go wild.
I mean, I've made some charts that I've wanted to make in the past. Like, let me give you an example. When I went to Weights & Biases, we built this flux chart. And in case the audience is not familiar, a flux chart is, you have a time period on the X axis. And so let's say it's a weekly flux chart or a monthly flux chart, and every tick is a time period. And on the Y axis you have a stacked bar chart that can go up and down.
Above the Y axis is good stuff. So this is number of new users, this is number of users that we previously thought were churned that have come back, that's called resurrected and number of retained users. So users that we had before that have continued to be active, what's below the Y axis? Customers that you've churned, or in some cases you can put some other things down there.
Building a flux chart is extremely laborious in most languages or like charting stuff. It's really hard in Hex. Maybe still impossible. At one point I think I was able to really hack and get one to work, but it's really close to impossible. If it's not impossible. Nick Kruchten can yell at me later if I'm wrong. He's the charts guru at Hex.
In Mode, this was like a Sisyphean task. I think I eventually got something close. Surprisingly, I was able to get this to work in I got this to work in Mixpanel. That was not totally a disaster, but anyway, this chart type, huge pain in the ass, but you can get it working and it's really valuable.
What's even more valuable is to be able to take an extract from that flux chart and make it a stacked bar chart on the same axis down below. Small multiples where you have like a small version of certain things that you want to track and the top one be that. You know what's even more nice? Adding annotations.
You know what's even more nice? Being able to add a different coloring to a subset of the time domain and within that time domain, different slices of the stack chart. Dear reader, what if I told you you could also change the background color of the axis so that you can see which quarter is which? You don't have to stop there. The sky is your limit.
When you are working with something like Cursor, this chart can get more and more and more rich. I find myself thinking that I can open Tufte's book and say, no longer is this reserved for artists. Now I too can make data visualizations that actually harness all the information that I want.
For one of our quarterly, like our QBR recently, I made this chart that I'm talking about not with like customers, but with a different data set underlying it. And I'm like A, there's no way in hell I could do this in any library. B, the amount of time that this would have taken to even get close. Insane.
And I only spent about three hours making this chart and it's everything I want. It's perfect. It's in our fonts, it's in our colors, it uses the right line style. I'm able to go as crazy and hog wild as I want.
Oh, the annotation looks kind of weird? Move it over a little bit. Make the arrow more wiggly. Who the fuck cares? Like you know, Cursor is just like, yes sir. And so I think this is my point. Like I think we often find ourselves thinking that tools are good. But I have become much more ambitious in a lot of my work because I'm now like exposed to this like superpower.
Dori: We're like self-limiting in ways we don't even know exactly because like, for people that are listening, three hours to make a chart like Bryan's talking about is not that much. Like I'm sure you're like, what do you mean? Three hours on a chart is like mind blowing. To do what he's done right now, the fact that you couldn't do it, but if you were going to try it, you'd have to restructure your data in a bunch of weird fucking ways.
You would probably give up at a certain point and just start doing what I was doing, either working with shapes or somehow making it transparent to try and get different, hidden things instead of actually having it. And then annotations, most of the time I would always just give up if it's in a deck and I would just create like, you know, "here's a text box. I'll just make it look like the right color, like everything."
And that was just very time consuming. So just try to give a little, flush it out for people that have not felt this struggle.
Bryan: You call out something that is really underappreciated about chart libraries that are good. Chart libraries that are good allow you to do things like aggregations and allow you to do things like mixed type, like underlying data. Because ultimately you come to the chart library with your data in a normal form. It's in some structure that it should be in.
And then you are often like, in the Tableau days, I used to then sort of like pull out my jackhammer and be like, okay, I'm going to now beat the shit out of this data to get into the exact form that it needs to be in. So that Tableau will allow me to make this chart type. Some of the modern, like, BI tools like Hex and Omni, they are pretty good about saying like, okay, we'll let you do like, aggregations in the charting library and make this kind of easy and powerful.
That's great. I'm really excited about that trend. But let me tell you, Cursor will do anything for you. Like, you want to aggregate in this, like, completely bizarre way to support two chart types simultaneously? Okay, who cares. Like, it doesn't have any qualms about this.
And so that's where I'm just like, yeah, my mind has really been transformed and this is just a teeny, tiny little effort. I mean, I spent two days in this library. This is not like, oh, the last like three quarters of like, engineering time. This is like two days of, I was trying to write the blog post about America's next modeler. I was pissed off. My charts looked like shit. And I was like, all right, let's go. So, yeah, now I can turn anything into a theory chart.
CL: Wow. Well, we'll definitely need to link to the charts we created. I've had my fair share about working with charting library and then listening through what you talked about, it almost feels like previously the charting library needed to create an abstraction, but it's like limiting what we wanted to do. But now with all these AI agents or coding agents that it almost like we would just need a bare minimal abstraction, like maybe D3 or something, that it's kind of very primitive, that you can just customize everything, right?
And then with natural language, say like, okay, now translate that to that representation of transformation and charting that we needed.
Bryan: Yeah, I started off by just telling Cursor, like, hey, I want to build a charting library that is mostly built on top of Matplotlib, and it takes YAML. And I want to be able to specify in YAML, like roughly what the chart information is. And then using that YAML, I want you to translate the different chart types. We're going to need to build, and this is when talking to Cursor, I'm like, we're going to need to build sort of like a translation from the YAML to a bunch of different map hotlib functions.
Build out a simple skeleton. Here's an example plot that I want to be able to build. And so that's where I start. And then it's not even that I want to go and define the YAML by hand, not manual YAML, if you want to be rhyming. What it is, is I'm actually going to make the agent do the YAML too. But the agent is just like, when it has all the structure, it's surprisingly good at just grinding this out.
And because it's YAML, it's easy for me to change things. And so I'm like, oh, like, I want to change the chart title. I just like, go do that. I have to look anywhere. It's like in a spec. Like it's so fast now for me to take a chart, make a new one with different data, changing a couple of things, or just tell the agent, hey, make me a new YAML. This is the thing that I want. Oh, we don't support that chart type? We do. Now get to work.
And so like, that's what I'm saying. Like this crazy flux chart that I made technically now that's a supported chart type in my library. I don't know how many times I'm going to need this crazy chart type. But like, I got it, you know, and at one point I'm like, you know what, let's just like, leave off the bottom part. I want to see what it looks like without the bottom part.
I just remove part of the YAML. Okay. It's like all so trivial and so it's Just very straightforward to build very ergonomic DSLs that are ergonomic for me and for Cursor agent later on.
Dori: Yeah. We had an earlier conversation where I think it was with Roger, where he was talking about using AI agents to help you create deterministic processes. Right? So instead of just letting it go wild, which is like what you're talking about, it's like I've made you, here's the YAML, you will make this deterministic and then work, versus just letting it totally have free reign.
Bryan: Absolutely.
Dori: Like forcing it to create structure and then work in the structure.
Bryan: Absolutely. Right now we're doing migration is a little bit of an oversimplification, but for sake of argument we'll call it a migration internally. We've got a ton of unstructured data. And the thing that generated this unstructured data has gone through multiple evolutions.
There was the period before I got here. There was the period shortly after I got here. There was the period a little bit later after I got here. And there was the most recent version where we went and enhanced the data generating process. All of those were non-trivial time periods. They're everything from like a year and a half to a month. So all of that data is important, but the structure that comes out of it is wildly unevenly distributed.
The most recent version is much more structured predictably, like I was trying to make it structured and good. Some of the previous ones were the right shape at the right time. But looking back were like okay. So why am I telling you this? Because right now we are migrating all this data into a new backend. And the backend we want to be structured. There's some specific reasons why, but one of the reasons is we want to be able to search this data.
Another reason is because I want to build another data product on top of this. And by having the structured version of this, the data product that we can build will be a lot more useful. But remember what I just said. The most recent version has a great structure that we planned. The previous version is like-semi structured. The previous version before that, it has a structure. And the previous before that, no one's calling that structured. It's just notes in the doc.
Well that's okay because I can basically like point an LLM as a MapReduce over all of these documents and say, extract this. If you see this exact structure, just use your regex engine. If you see this weird structure Use your regex section on these parts and use your LLM brain on these parts. If you don't see any structure, have fun. Do your best, extract what you can. And so what's crazy is it's not even like I use an LLM to crunch a bunch of unstructured data into a structured form.
Nope. I'm literally using an LLM to do several different types of things simultaneously. And on the parts that are structured, I'm doing structured shit. On the parts that aren't, I'm not. And I'm even getting visibility into like exactly what's happening where I use the MapReduce terminology, not because like it is a true MapReduce, but like it is very kind of similar paradigm where some things I'm just mapping to the next stage and then in some cases I need to kind of reduce and I even do some splitting.
And so as I talked about, there are some of those formats where it's like it's semi structured. We're going to send that off to just this piece and this other one we're going to process with LLMs and then we're going to reduce them later on this kind of DAG or this workflow builder. Where does this exist? It doesn't, but you can build it in a day.
Dori: Awesome. That's cool.
CL: Yeah, I think we are like witnessing this new birth of a new paradigm where it's almost like personal software. Right? But it's like deterministic when it should be. And then like heavy lifting to random stuff like to some extent of your confidence of the LLM, and then just like yeah, go.
Bryan: Yeah, totally. And I mean it's certainly not the case that I'm just like, "eh just YOLO, like have fun."
Like I'm building evals for the pieces of that that are LLM generated. I'm even building evals for the part of the code that it's going to use in this deterministic way. Like, I am supervising it pretty carefully, but it is definitely one of those things that's like I very much don't know five years ago, like how I would have achieved this goal. Like I don't really see like a clear path five years ago to doing this kind of task. It's so mixed in terms of like the input data.
Dori: Thank you for sharing. What is one principle you wish every data team lived by?
Bryan: I think truth seeking is like the number one for me. We have this assumption in the industry that like, everyone is truth seeking. But occasionally I'll bump into, even people on data teams. Like, I come from math, okay? In math, there is nothing else. There is only truth seeking. That is our coda.
Dori: Literally, where's the proof?
Bryan: Yeah, exactly.
Dori: And where's my proof for the proof?
Bryan: Right, and so, like, entering data, I'm kind of like, oh, these are people that care about data a lot. Like, they're also very truth seeking. And for the most part, I think I've been happily surprised at how truth seeking people tend to be. But I will say, and it's a little unfair to lever this against data themes because again, it's mostly been good.
But there are people that are data team adjacent that are not truth seeking. And that has been my biggest disappointment in industry, period, is the level to which people are seeking other ends and priorities. Like, people will openly talk about telling a great data story to make people happy or telling the data story that gets you into the decision. And I'm like, no, we don't have to do that.
Data storytelling is great. I love to tell stories with data, but the data storytelling doesn't need to be synonymous with telling people what they want to hear to get yourself a promotion. If you have a successful career built on glazing people, I kind of hate you.
That's like, actually, like, almost enough for me to, like, classify you into, like, bite me. And what's shocking is people will say without cracking a smile, like, oh, yeah, like, a big part of, like, being a successful data scientist and is, like, telling the stories that are important that people want to hear and, like, make the org a success. And I'm like, I don't think that that has to be in contrast to truth seekingness.
I think, candidly, there's been a big divorce in the industry between folks wanting data teams to do work for them to tell them what they want to hear, but not realizing that signing up for working with the data team means sometimes getting told things you don't want to hear. And there's a surprising conflict here.
And, you know, we've seen the sort of, like, boom and bust of data science. You know, you hear unfortunately, like, once a month now, someone's like, ah data science is dead. And I think that's absurd. But, like, it is interesting. Like, I think the reason we've seen this boom, bust cycle is I think it sounded exciting to executives to be able to say, like, I've got a data team and I operate as like a data driven organization. And then they're like, oh, God, like, the data isn't what I thought it was going to be?
Dori: Yep.
Bryan: Like, let me give you a positive example in this category. When I was at Blue Bottle, we had this disagreement amongst two folks in leadership. One person was like, if you come into a cafe and you see very few of a pastry, that's going to make you more excited to buy that pastry. And the other executive was like, no way. If you come into a cafe and you only see one of this pastry, you're going to think it's like the bottom of the barrel. You're not going to want that pastry.
And these two executives, and these are the top executives at Blue Bottle, they disagreed. And I was like, we could probably design an experiment to test this. And they were like, that would be great. And it was hard, it was really difficult to design an experiment to test this. I mentioned it earlier, but I did. And we ran this experiment and we ran it in a controlled way.
I had a statistician on my team at the time. We went back to the literature of how could you design a real life experiment in this way? And we designed an experiment and we tested that hypothesis. And I remember the one that was wrong and it was a little bit like, nuanced in terms of like, who was wrong. The one that was wrong was like, that's so interesting. I'm really glad that we ran this experiment.
And I just remember thinking like, that's data driven. And so later on when like, an executive would be like, oh, like, we've been running this playbook and I believe in this playbook. And I come back and I'm like, that's not working. Like, here's all the data that says it's not working. And then they throw an absolute, like, conniption fit. And their entire identity becomes trying to go and find errors in the data analysis.
Like, I think that's been the contrast that I've seen and I've been really fortunate. I've gotten to work at places like Blue Bottle and Stitch Fix, where the culture is really deep truth seeking. And you've probably heard before that Katrina Lake is kind of famous for being like, well, if the data says that this is what we should do, we should probably give it a try. And that's what I experienced at Stitch Fix is like this, like, really deep belief in like a data culture.
And even that first startup that I joined, it was acquired by IBM. Michelle, the CEO, I still remember like we had this technology that was selling very well and I still remember like I got some data that was like, I actually don't think that this is like as good as we've advertised. And she's just like, really? And she like called the head of marketing, in the meeting and she's like, we need to adjust the website. We're overselling like how good this is. Until Bryan can figure out like what's going on, we need to fix this.
Again, I'm right out of math grad school. I'm like, yeah, that sounds really appropriate. Only now, 15 years later am I like, what a weirdo. You know?
Dori: Haha!
Bryan: So it's great, it's great that I've gotten some opportunities to work with like really, really truth seeking individuals. And so that's the thing that like, I wish it was more evenly distributed. I wish every data team. And again, it's mostly not the data team itself, but there are some, I would say like clout seeking individuals that will go after whatever it takes.
Dori: I think it's not even clout because I think it's, what I'm thinking about, I worked with a bunch of different PMs in my career. Like you don't have anyone telling you numbers, so you just get to go off vibes and user interviews. And then once you get numbers in there because you're like the first hire, you're telling a lot of people a lot of uncomfortable truths.
Like the dark side of like, I think the data role, especially on like when you're working with product, is you're telling people a lot of what doesn't work, which people don't like because they haven't had that. And so I'm curious about how you've helped build that trust because like, sometimes it's like been a barrier for me getting data ingested somewhere usable for me. Because they know once the data is somewhere they're going to have to start being accountable in ways they hadn't necessarily been accountable before.
Bryan: Totally. It's definitely hard. It's interesting that you brought up PMs especially because I've been very fortunate. I've gotten to work with some really amazing PMs over the years and there's almost a direct correlation between the PMs that I'm thinking of when I say some amazing PMs and the ones that were like, oh, this isn't working. Oh, that's super interesting. I wonder how we could fix it? Versus the ones that are like, no, I think you're probably measuring it wrong.
CL: Beta versus ego.
Bryan: Yeah. And I think ultimately that's what a lot of this comes down to is like, to what degree does the person believe that they have some unique instinct that is more powerful than the natural experiment, which is our world. To answer your question about like, how do you get better at this? I think some people would say that I have not gotten better at this.
My personal reflection on this is like, people know that I don't take myself very seriously. And so I tend to have success with approaching it with a lot of like, willingness to dive in as a follow up. And so I never say, this is the end of the story. I'm always like, okay, like, where do you think that there might be more to the story?
And so I think inviting that in has been like, more effective than, "I'm pretty sure this is the right answer. Stop doing what you're doing." But it is tough. And you know, CL said the word ego and that's ultimately a big part of this is just like how much curiosity is on the other side of this relationship.
I think curiosity is one of the most valuable assets as a data person you can possibly have. Curiosity means that when they are conflicted about your results, you are curious as to why. Saying, "you are wrong," is not curious. Saying, "why are your priors different than what the data is showing," is curious.
Dori: Mhm.
Bryan: And so--
I think trying to bring that level of curiosity to everything is aspirationally the way that you succeed here. But ultimately there will always be hurdles here and egos are a big part of it and people want to get promoted and people want to be thought of as really, really smart. And for whatever reason, some people think that being wrong is not compatible with being smart. And I think that's a bummer.
Dori: Yeah.
CL: Okay, so I guess one last question before we go into our lightning round. So we talk about a lot of the space and how AI is reshaping things and all the data culture or data leadership. And I think you're always in the forefront of all this. Right? What is the hype trend that you think will derail in the next three years?
Bryan: Okay, so if I'm specifically thinking about things that I think are not going to be as effective as people have hyped them up to be, I think self-serve data analytics.
Dori: Yes. I could get behind this. I can 100% get behind this. Yes.
Bryan: Oh man. I've believed in the promise. I've made the promise. I've tried to build the thing that delivers the promise. I've signed up for the promise. I just continue to be disappointed. You know, I wrote that blog post, the Hunt for a Trustworthy Data Agent. The reason I called it what I called it is because for me, like I ran that as an experiment. I didn't build the hackathon because I was like, "oh, like I want to show people why they're wrong."
And I didn't build it as a, "I think this is already here." It was, it was an experiment. I was like, okay, let's run an experiment. Let's see where we really are. And the number one thing that I came away from that thinking is:
We are nowhere close to self-serve. We aren't even on the same outer rim of a galaxy. It is crazy how far we are from self-serve data analytics.
And I don't want to like talk too much of my own book here, but like trustworthy data agents is a thing that like, you know, coming back to this like truth seeking, like that's the point. Like we're trying to get to real answers for things. If all we want to do is give answers, then yes, you can ask the sycophantic data scientist who's ready for the next promotion: How's our growth looking this month? It's looking great, buddy. Like, yeah, you could do that.
And similarly you can ask a cursor agent, how's my growth? And I'm sure it'll tell you. If you just want answers, maybe it'll give you answers. But trustworthy answers I don't see anywhere close. We're not even close on the analytics questions of what's happening.
If you can't tell me what's happening, then you can't tell me how it's happening. If you can't tell me how it's happening, you sure as hell can't tell me why it's happening.
You know, people were selling third base sliding into home and I'm like, that's cool because I'm still on first base. I have yet to see anything on first base. And so I don't know. That's my very self disappointed answer is I think we're nowhere close to self serve.
Dori: I don't have as much experience as you, but I've absolutely come to believe, and I used to be on the self-serve like let's give to everyone, let's have them empowered. And I'm curious, I've come much more on curation on like, how can I get people curated, not necessarily opinionated, like they should be making their own opinions, but like curated data that can help in answer their questions and more importantly help them ask the right questions.
Because I think that's the other thing about self-serve that you didn't necessarily touch on but I see is like we're not even asking the right questions. Like, that's been a good chunk of my job is getting people to even ask the right question.
Bryan: Yeah, I mean there's definitely something to like, what is the right question? Because coming back to like, why was there a boom in data science? Well, the allure is like, we're going to ask great questions which yield great answers to change our operational behavior. People didn't want to ask like, how are our growth numbers? Because they're like interested in like a particular number.
No one cares about the number on the screen. What they care about is: Should we change our behavior?
Dori: Yeah.
Bryan: The problem is the things that you need to ask to change your behavior are not what is our sales number? It's not how many dollars of pipeline do we have that doesn't change behavior? What changes behavior is the follow up question. Okay, our pipeline is not as good as we'd like it to be. Like, where are we missing on pipe?
Okay. Or we're missing on pipe in mid market. Okay. So like we're missing in mid market. Any particular reasons why? Okay, well, bookings are way down. Okay, well why are bookings way down? And like none of these questions, like it it's all about like, what operation can we change, what action can we take? And those require a lot of questions.
What's exciting about LLMs in the self-serve story is they're tireless. You could just like pepper them with questions all day. What's nuanced is LLMs are successful at answering data questions in a very curated garden. But this stuff gets out of the distribution instantly.
Coming back to the sort of like relationship between data teams and others, like the most common thing that I've ever experienced with great leaders that are data driven is that I can't answer their question. There's like a weird like relationship where the ones that are good, the really amazing leaders from a data perspective, are the ones that ask me so many damn follow ups that they eventually get to the limit of my knowledge.
They exhaust not just my patience, but my knowledge. Usually when they're asking a lot of questions, I'm not like, "I'm so tired of them asking..."
No, it's, "shit, actually. I looked at 80 things and that's a really great perspective I've never seen." I worked with one executive, Yonda. Yonda, I felt like no matter how prepared I came, there would be a moment where he would ask me something. I'm like, yeah, that is a really interesting way that we should look at this. And that is ultimately probably the way that we could take action on this. I'll get back to you.
And so what's exciting is that's inexhaustible for LLMs. What's scary is we're nowhere near where they can do that kind of iterative work that they can kind of like put the signals together again to come back to the answer data set: Why did I build a data set that has like multiple modalities? It's because in my experience, not everything is in the SQL Warehouse.
Like, the answers to these questions is not in Snowflake. The answer to these questions is some combination of like your dumbass document store combined with some logs that you don't know how to get and you have to ask an engineer for and some untransformed data that's like technically got the answer, but it certainly is not in the table. And so that's why I made this hellish data set that I made because, because like, I don't know, that's what I care about.
Dori: Yeah.
CL: Well, this has been awesome and exciting. Thank you so much Bryan. Before we wrap, we're going to put you in our data debug round. Quickfire questions, short answers. Are you ready?
Bryan: I'm ready.
CL: Okay. First programming language that you loved or hated.
Bryan: Smalltalk. I'm going to get canceled for that.
CL: Okay.
Bryan: That was on the love side, by the way.
CL: Okay, your go to data set you use for testing?
Bryan: Cars.
CL: Oh, what car dataset?
Bryan: There's a famous cars data set. It's a bunch of features about, you know, like how big is the engine? What's the gas mileage? It's got enough variability in many of the features that I can show my students how different things break and I can explain why you need things normalization and stuff like that.
Dori: If I remember correctly, RStudio just had it.
Bryan: Yeah.
Dori: Like, pre-loaded.
Bryan: Yeah.
Dori: It would just be there. That's why I was like, oh, yeah, I know exactly what you're talking about. It was just there.
Bryan: Yeah, it's way better than wine, but it's kind of similar to wine. I find it more useful than housing. And obviously, like, penguins is tiny.
CL: Okay, what's one lesson outside of tech that influences how you build?
Bryan: I'm going to come back to this curiosity and questions thing. I learned from a feedback class that I took that feedback is a question, it's not a statement. And by thinking about that and by finally coming to grips with that and learning how to use that, I started applying it to other things. I started asking, like, what if I could pursue everything via a question?
What if, like, everything I choose to do, I choose to speak in questions as opposed to statements? That is surprisingly effective as, like, a manager, but it's also surprisingly effective as a builder.
CL: What's one hot take about data you're willing to defend on the podcast?
Bryan:
We are in the age of unstructured data, and people are not using it enough.
CL: What's your favorite podcast or book that's not about data or tech?
Bryan: I read a lot of nonfiction. I really like the book Mosquito. It's good. It's about the history of civilization through the eyes of the mosquito.
CL: Wow. That's definitely interesting. Well, thank you so much, Bryan. One last thing is where can listeners find you? And then what do you want people to do more?
Bryan: Yeah I'm pretty easy to find on both LinkedIn and Twitter, and I now even have a Substack. So on Twitter, I'm BEBischof, and on LinkedIn I'm just Bryan Bischof. And then my Substack is called Pseudorandom Generator. And what I'd like people to do more is ask questions.
Dori: Love that. Thank you so much for joining us today. It's been an absolute treat and pleasure. I really enjoyed this conversation.
Bryan: Awesome. Thanks for having me.
CL: Thank you so much.
Content from the Library
Data Renegades Ep. #6, From Big Data to Curiosity-Driven Insight with Roger Magoulas
On episode 6 of Data Renegades, CL Kao and Dori Wilson speak with Roger Magoulas about the real bottlenecks holding data...
Data Renegades Ep. #5, The Identity Crisis of BI with Benn Stancil
On episode 5 of Data Renegades, CL Kao and Dori Wilson sit down with Benn Stancil to explore how data tools evolve, and sometimes...
Data Renegades Ep. #3, Building Tools That Shape Data with Maxime Beauchemin
On episode 3 of Data Renegades, CL Kao and Dori Wilson sit down with Maxime Beauchemin. They explore the origins of Airflow and...
