Hilary Mason: The Rise of Data Science

June 9, 2021

Episode transcript

[Auren Hoffman] Welcome to World of DaaS, a show for data enthusiasts. I'm your host, Auren Hoffman, CEO of SafeGraph. For more conversations, videos, and transcripts, visit safegraph.com/podcasts.

Okay, Hello fellow data nerds. My guest today is Hilary Mason. Hilary is the co founder of hidden door and data scientist at Accel partners. She previously co founded Fast Forward labs which was acquired by Cloudera and served as the chief scientist of bit.ly. Hilary, welcome to World of DaaS.

[Hilary Mason] Thank you so much. So I'm excited for this conversation. ‍

[Auren Hoffman] Excited as well. Okay, so you and I, we met 10 years ago when you were running data science at bit.ly. Now beside for like, the tools, I know a lot of the tools have changed. What are some of the big non obvious changes that's happened in data science since then?‍

[Hilary Mason] Yeah, it's really fun to think about data science in this order of decades, because we are there for one thing. Like I think, in this last year, with all of us staying home, we missed an opportunity to have a really big party just to celebrate that we made it this far. But also, because we have a long way to go. So if you think back 10 years ago, and what's changed between then and now you can sort of do it at different layers of the data science stack. So at the bottom, we have infrastructure, and even access to resources to make things go. And of course, now we live in this world of like effortless scalable microservice cloud computing, which, even 10 years ago, you had to put a lot of effort into accessing the right compute for the kind of problem you wanted to solve. Painting it, like the amount of friction there has just essentially gone close, I wouldn't say zero, but it’s dramatically declined. We have made tremendous progress.‍

[Auren Hoffman] Just to pull on that thread for a second. So there's this percentage of time that a data scientist is actually doing data science and a percentage of time that they're doing like data munging, and, you know, data ordering and stuff like that. Obviously, the percentage of time of the latter is going down, what would you say it was 10 years ago? What would you say it is today?‍

[Hilary Mason] You know, of course, the answer is it depends, of course. But I would say that, and this is tied into my second point, the tooling and the commoditization of capabilities that allow you to easily clean data, as long as you know, it looks like data that somebody has cleaned before, as well as the ability to model like, both of those things have progressed dramatically. So it used to be that you might be there, like I still sometimes reach for awk on like a command line, because it was in some ways, the best way to clean data quickly. Whereas today, I would hope that somebody who does not have the scars, I have from, you know, 30 years of bash, would pick up better and faster tools for those sorts of things. And I think if we go up the stack from tech, of course, we have a bunch of algorithmic capabilities we didn't have 10 years ago. We have lots of SDKs, APIs wrapped around those things. And that sort of, both changes the nature of the problems we can approach as data scientists. Like 10 years ago, at bit.ly, for example, we had a ton of images being shared. We very much wanted to know what was in those images. And you couldn't, that was an unsolved problem. And the companies that existed at the time to solve it had human beings in the background or chose a subset of the problem that was actually tractable. Like we're gonna recognize logos and images, but not actually. It was that sort of thing. So that's changed dramatically. And then I think I also want to mention before we move on, there's been a professionalization of data science, where it is no longer a novel thing. To be a professional data scientist is actually a job.‍

[Auren Hoffman] It's a real profession today that people have heard of, and lots of people do. ‍

[Hilary Mason] That's right.‍

[Auren Hoffman] Okay, yeah.‍

[Hilary Mason] Now, I think if we look 10 years out, we can debate what it will look like to do that job. Like that's kind of a fun direction to go in. But yes, it is an actual job.‍

[Auren Hoffman] The number of data scientists today, it seems like, I don't have the numbers, but it seems like at least 10x it was 10 years ago. It's gone up dramatically. Like how do you, how do you expect that trend to continue over the next 10 years?‍

[Hilary Mason] I think we're gonna see potentially, one, we might see a fork, we were at a fork in the road, we might see two different things happen. And I don't know which one it's going to be and it might even be both. The first one is that we just continue on this linear growth trajectory. And, you know, 10 years from now we have 100x, the data scientists we had 10 years ago or 1,000. And the other way would be that the tools of data science become useful enough that people start to bring them into other professional roles. And so you might lose. Or you might have a smaller number of people whose main profession is data science, but we would have data science being done by people whose titles might be things like Product Manager.‍

[Auren Hoffman] It seems like marketing. You know, the Sales Manager does data science today. Finance person, it seems like everyone is doing a little bit of data science. And maybe they're only doing it today in a crude way, like in Google Sheets, or Excel or something, and then we could see that kind of take off a little bit more in the future.‍

[Hilary Mason] Yeah, and I think personally, I believe that data skills will be a requisite skill in the executive suite and on down for everybody, 10 years from now, and they are not today. And this also leads to all kinds of mismanagement of data efforts. But, I think it might be A or B or both.‍

[Auren Hoffman] It's interesting because when you think of a sales manager, a sales manager is extremely data-oriented. In fact, many times, they're way more data-oriented than the engineering manager. The engineering manager, there's a lot of soft things that they need to figure out. Whereas a sales manager's very hard on the numbers that they need to come to these inputs and outputs that they need to do so. So you wouldn't think sales is a super data-oriented thing. But in fact, it may be one of the more data-oriented places in an organization.‍

[Hilary Mason] Well, it is. And it's also when you have something like money that you're quantifying, and it's very easy to keep score, it becomes easier to start to become data-oriented. Where when you have something like software engineering output, that's, I mean, you can quantify it in a few different ways, and people certainly try. But yeah, it's hard.‍

[Auren Hoffman] It'd be really interesting. If I could have invested in the stock of data science 10 years ago. Is there some sort of job today that we will see 10x growth in the future that people aren't thinking about it as much?‍

[Hilary Mason] So I do think that there are professions that we will see in the future, with a similar growth trajectory, that are adjacent. And these are things like people who can do robust UI design around machine learning. So thinking about systems that may have probabilistic outputs, that may have error that needs to be communicated effectively to whoever the customer or the user of that system is, that requires a different set of design tools than the typical sort of UI design toolkit today. And those people already exist, and they're already out there. But I think they're not, they're not sort of recognized as a separate specialization.‍

[Auren Hoffman] Do you think they'll have a name, like that type of thing will have some sort of name? Because I've seen a few people who are really good at that, but I don't know that there's a way of describing their profession.‍

[Hilary Mason] A product design and machine learning products, maybe something like that, or maybe it'll just become part of the tools we expect everyone to use. And maybe we'll see the tools that people in that sort of UI design role tend to use accommodate uncertainty and design around machine learning type features, as those things become more common. But that's one area where I think that, um, if you are inclined in that design direction, but you're also really interested in machine learning and statistics and the world of probabilities and error bars and all that stuff. Like there is a great career for you to have, like that's one area I would jump into.‍

[Auren Hoffman] One of the things is interesting as it seemed like 10 years ago, when I met a data scientist, they were incredibly fluent in probability and statistics. And they had a deep, deep knowledge of probability and statistics. It seems today that the average data science person has less knowledge of probability and statistics. Is that because there are all these tools that help you abstract some of that. First of all, am I right? And then secondly, is it because there are all these tools? And that's less necessary to understand, you know, these deep probability types of things?‍

[Hilary Mason] I'm not sure I would agree with that. ‍

[Auren Hoffman] Ok, alright good.‍

[Hilary Mason] And, you know, of course, this is, the set of people I think about are different than the set you think about right. So, you know, we could both be right or both be wrong here. I would say that a strong understanding of statistics fundamentals is critical. And it's critical because ultimately, the work of data science is trying to make better decisions that are informed by data. And whether you're doing that in a way that a system is automating those decisions or you're making, you know, a PowerPoint that somebody is going to read and make a decision off of like, fundamentally, you're trying to make a better decision because you have some data. And I don't, yes, we have tools that allow you to start to do more interesting work with a lower bar to entry, which is great. But I also think that skill set is really important because so much of data science is about good judgment. And sort of, and it's everything from like, how do I frame the problem I'm trying to solve or the question I'm going to ask, how do I know what error metrics? What does good look like? How do I know when I'm done working on this? And, and the process of the work is so deeply creative, but also, you know, one of the common failure modes with data scientists, they get so enamored of the problem, they sort of runoff and go into a hole and you don't hear from them for a year. Yes, they're like way down the rabbit hole doing work that, you know, so much of our work, you know, you get 80% of the value pretty quickly, and then you get to 81%, and then 81 and a half percent, and you have to know when to stop, like when it's good enough for the purpose of the thing you're building. I would say that, that deep understanding of basic probability and statistics, I don't mean you have to have a PhD in it, but that you do have to be fluent with it. Like is, is one of the tools that data scientists bring to their judgment. ‍

[Auren Hoffman] When you're starting on a problem, let's say you're at a company, it makes sense to kind of first kind of focus exclusively on your internal data, because there's so much low hanging fruit. And then as you get better and better at like getting those answers you kind of like further and further on the curve of getting good at data science, then at some point, it makes sense to, like, bring in external data to help you. Is there some sort of curve that you see like, okay, let's the lowest hanging fruit and then like, we'll bring in some external data? Or where do you see that continuum?‍

[Hilary Mason] Yeah, that's a good question. And again, of course, it depends. But yes, we do tend to see these common patterns where the first thing is really to understand the product, the business environment, and the data that already exists. Understand the provenance of that data, build systems to clean that data, to make sure that when you have 10 people looking at that same data set, they're looking at it in a similar way, if that's the intent, to make sure that they can share work so that if I do the work to write a query, and you have a similar question. And this is this is at the high level of things. Like let's say we want to understand like what our weekly active users are some new cut of data that pretty much every product has, we should define active at the same way weekly, the same way. And these things are they sound obvious, they are not obvious. So like, if I give you a number and you, you do your own work, right? They should match. Yeah, but they often don't. So it's like just building that first layer of fluency with the data. ‍

[Auren Hoffman] Even if we're doing a churn analysis or something like we come up with very, very different answers just with slightly different definitions.‍

[Hilary Mason] That's right. And so you're trying to have enough empathy for the business problem that you can then go apply it to the data set, and that you want people to be doing that work consistently unless there's a reason to diverge. So you have to build that base layer of, you know, knowledge of the data and how it was collected, sort of tools and API's for interfacing with the proprietary data. And then as you say, I find there's sort of two curves here. One is, is that sophistication with a company's own data and practices around it. And then there's the sophistication with the set of problems and questions they want to ask where sometimes you reach a point where you can take a much bigger step forward by introducing new data. And particularly, this happens when you have data collected from the world in some way. So like human behavior data. Another one would be things like weather data, or environmental data. Right, we're bringing in some third-party data that reflects another dimension of the space can actually improve your modeling dramatically. Right? And so yes, so do you see this, this sort of like growth of sophistication on and it crosses sort of individual analysts data science practice, team practice, infrastructure, software tooling, all the way up to sophistication of the business in wanting to even ask the right questions or getting more sophisticated in the nuances of the questions they're interested in and exploring those.‍

[Auren Hoffman] The hard part then is like, is the data scientists really needs to understand the business. Like truly understand the business? Like, how can a company design the organization so that the data scientists can understand the business and they're not just in this like weird pricing optimization thing that they're kind of focused on?‍

[Hilary Mason] So I love this question, because there are several answers and they're all wrong. The right answer for a given organization is not the same necessarily as another organization, and it likely changes over time. Like when I started advising companies on data science practice, I thought, we're going to come up with the right answer. We're going to implement it and we're done. And if it didn't work, eventually we had failed. But I've actually come to learn over time that some of the best organizations sort of swing in one direction for a while, realize they've introduced too much friction in another place and then swing back a bit. So to your first point, empathy, and interest and understanding of product and business is definitely required for good data science work, like absolutely required. ‍

[Auren Hoffman] How would you interview for that? Like, would you just try to like dive into their last business and really try to figure out if they understand it, and they have some different nuances of it? How would you even know that when you when you talk to somebody?‍

[Hilary Mason] So personally, when I interview data scientists, I go at it from two directions, one of which is that I give a hypothetical, that is vague enough that they have to ask questions to understand enough to frame a problem. And I get the same hypothetical. And it's one that I always choose the problem that everyone has an intuition for. So it's usually something around like, shopping or something like this. So you would know what questions to ask. But I look for some really clever thoughtfulness around how do I go from this vague, not even clear data science statement to a clear framing of a problem? And then let's talk about techniques. And the other thing I do is exactly as you say, sort of pick something out of the portfolio of projects, they say they worked on and say let's go really deep into this. How did you end up getting to this question? Because of course, the dirty secret of data science is that you start with a question, you realize you can't solve it. If you reframe it into a more achievable question, you realize you can't solve that. And so you've like iterated your way four questions in before you actually do a piece of useful work. So trying to understand their thought processes and looking particularly in senior folks for the maturity to manage that process themselves. And for junior folks, I look for the interest and curiosity in asking those questions. And sort of, you know, not expecting, because honestly, when somebody comes out of an educational program, and they haven't done a lot of the supplied work, they often are used to getting these well-formed questions with potential answers for them. Like that's what your homework looks like? Yeah, the real world is not about doing homework. So it's really looking at their ability to form their own questions out of a bit of a messy problem space.‍

[Auren Hoffman] Okay, really interesting. So you know SafeGraph where I work, we sell data, we sell data to data science teams, but we just sell data like no tools. What advice would you give to these data companies, who are trying to sell data to data science teams? How can they tool it? Or how should they describe their data? Or what should they be doing to sell well to data science teams?‍

[Hilary Mason] It is a hard business you are in for a bunch of reasons. So and a lot of people think it will be easier than it actually turns out to be because the first thing is that you have to show that your data is actually valuable. And so do a bunch of work just to do example analyses, like here are things that we could be informed by from our data. Where we think we have questions that are similar to questions that you our potential customer might pose. Then you have to prepare for varying levels of interest and sophistication in the data asset itself. So some people will want to clean pre-processed, they will be happy with your definition of a day. Some people will want everything as raw as possible, they'll think everything you've done is wrong. Because it may be given their, you know, stack and setup. And so you have to be prepared to have that conversation at varying levels and potentially even provide both. And then you have the other problem, that this is a two-door sale because data scientists don't usually have their own budget, or if they do, it's very small. And I assume you don't want to sell your data for peanuts. So then you have to impress the data scientists so much that they go to whoever holds their budget and says that we need this. And you have to impress that person as well. So it's also a challenging sales cycle, which means you have to have both data scientists out there on your sales team, as well as sales professionals who can talk to and again, because data scientists don't sit in any one part of the org, they have to be able to be fluent with CEOs, CIOs, CFOs, CEOs, like head of data science, VP data, digital people, like they have to be pretty sophisticated.

‍[Auren Hoffman] Okay, well, you are you're depressing me because we do have this very difficult thing. ‍

[Hilary Mason] Well, you seem to be doing a great job. ‍

[Auren Hoffman] It does seem though, as you mentioned, the average data scientist today just seems way more dangerous, in a good way. Like, they're way more competent and they can do more stuff because of all these tools that are happening. Like how do you see that progressing over time, is it just going to be like you said, the average analyst or the average marketer in 10 years is going to be able to do what the data scientists can do today? And then the data scientist today is gonna be just like so much more dangerous and even more capable 10 years from now, or how's that going to change over time?‍

[Hilary Mason] I hope so. Um, I hope so. Because I think that the more people who don't have to talk to anyone else to use data in their workflows, the better everybody gets at the work that they're trying to do.‍

[Auren Hoffman] They're going to move much faster.‍

[Hilary Mason] Right. That said, tools always have limits. And they also tend to go wrong. And so you know, if you were to ask me things like, what about like auto ML that will auto fit, you know, a model to your data and give you a classifier? That works really well, when it works. But when it doesn't work, you still need somebody who knows what they're doing to figure out why and what you should do instead. And so I do think we may end up where tooling sort of gives more people more capabilities. And that is, frankly, only to the good, even if they use them in somewhat bonkers ways sometimes. Like, I've had, you know, some of our clients that Fast Forward Labs who are sort of like, okay, do we have to have every analysis, everyone produces run through a central data science team to approve it or say it's robust enough. I don't think you need to do that. I think that you can accept that everyone should have the chance to do the work they want to do and that occasionally, some of it won't be great. And some of it will be amazing. But then you need to have specialists who can really go deep on sort of fundamental approaches to addressing new questions where there may not be tooling, or where their tooling falls down, given the specifics of what you're trying to do. And what I mean by that, is that the more generic a data science, like pre-baked solution is, the less accurate it tends to be for your specific use case, because by genericizing it, they have lost information, or lost some of the nuances of the project.‍

[Auren Hoffman] That would be any ML model that's off the shelf today. Okay. Got it. ‍

[Hilary Mason] That's right. And so that's one of the dangers where yes, tooling will progress. And yes, everyone will be playing with this stuff. But also, you're going to need specialists who can -- where it's worthwhile -- actually, sort of invent the approach you should be taking. ‍

[Auren Hoffman] And how do you like, think about because, you know, a lot of times we will run some sort of model, and you get some sort of answer, and then then you start to look at it sideways, like that, that just can't be true. Like, how do you at some point, there does need to be a human really like scrutinizing these answers. Right. And kind of like, I don't know, digging some other approach to them, like, maybe a more qualitative approach.‍

[Hilary Mason] So I think you're getting it part of what makes a good data scientist good at their job, which is a couple of things. So one is that they have that little that intuition, or that little spidey sense that like, okay, this is too good to be true, or this actually doesn't match my intuition. Let's dig into this a little bit and figure out what's going on. And that quantitative methods are not the end-all of data science, in fact, like, they often tell you what, or how, but they almost never tell you why. ‍

[Auren Hoffman] And if the world changes too, as the world changing all the time, then they have no idea.‍

[Hilary Mason] Yeah, right. And so knowing when to actually go talk to somebody who might have some, you know, first hand experience that can inform the why or when to sort of throw up a flag and say, we're seeing this, we don't know why, but it's worth highlighting. I can give an example of one company I was talking to you a few years ago, had this boost in sales for two weeks, every March. They had no idea why, like, but it was predictable, like last five years of their data, and it was significant. So, you know, the data science folks working in the CFO's office were like, you know, oh, this is cool, we can predict we're gonna get this much more revenue. And the CEO started saying, cool, what's going on? Like, right, maybe we should understand this. They're like, oh, we don't know. And it turned out, it happened to be one state's, you know, school buying, like their school, public school system budgets were released exactly at that day. And then they, you know, bought everything in bulk for the whole next year. But they didn't find that out. So they talked to the sales folks who were actually on the ground. And they're like, of course, we know what's going on here. So it's just putting all these pieces together and realizing that your quantitative methods can help you make the prediction that this happened in the last few years, probably it'll happen again this year. But it doesn't give you the why. And I think really good data scientists and analysts are good at sort of running up that flag and saying, cool, I have to go talk to someone now to figure this out.‍

[Auren Hoffman] You're really also interested in data ethics and you co-wrote Ethics and Data Science with DJ Patil. And it's actually free, which is really cool. Anyone can read it. What are some of like, the less obvious takeaways that somebody dealing with data should be, should be thinking through?‍

[Hilary Mason] Thank you for asking that question. And for everyone listening, you should go get our book, because it's very short and very free, and hopefully very fun. But, no, we've heard that book, because we're looking at a lot of the conversations around ethics and machine learning and AI at the time, this was three years ago. And realizing that there were very few resources out there for people who, you know, have hands on keyboard or doing the work, but want to do it well. There are a lot of philosophical resources, there are lots of critiques. But it's very hard if you're sitting there with a question, and you're doing the work to know like, what are the tools I can apply so that I can be fairly confident that I am doing something ethically. ‍

[Auren Hoffman] What is an example of an ethical dilemma that someone might be going through, there's no obvious answer.‍

[Hilary Mason] I have seen so many things, let me think of one that's not obvious. Let's say you're trying to decide something like, which marketing segment you want to offer a certain kind of promotion to. You might be very motivated to make sure that you are not discriminating based on, you know, race or gender or other aspects, demographic aspects, but maybe, you know, you want to classify your customers in a different way. That's just sort of, that's the kind of questions somebody might be thinking about is sort of like, what discounts do we want to proactively offer to whom? And what behaviors do we see in our customer base that we may want to understand independently of the sort of crude demographic categories? And I'm assuming in this hypothetical, that we're also in an industry where we're not regulated. Because the folks who who are in finance or insurance, like generally have a pretty mature understanding of how to do this. And how to do it right.‍

[Auren Hoffman] FCRA thing or some other type of thing they have to adhere to or HIPAA thing for healthcare. ‍

[Hilary Mason] Yeah. But let's say you're at a telecom company, or you sell, like, hipster toilet paper, or whatever it may be. And you still want to do this fairly and correctly. Um, I'm like, those are the sorts of questions. I've also seen plenty of issues where people are using data to make fairly significant decisions about consumer access to services. And that data may not directly inform or may have large gaps in it. And it's unclear what you do in the instance of those gaps. And like, do you try to infer missing data? Let's say, you know, you're building a service, but you're missing certain demographics from your data set entirely that may be represented in your customer population. Like, how do you think about that stuff? And with our book, we were not trying to answer these questions at all, we were trying to provide frameworks and tools that data scientists and folks who work with data scientists could apply to answer these questions. And we were also, you know, being honest about it, trying to create a prop for data scientists who had these concerns to bring into their organization to say, you know, here's some people who've been doing this for a while and have some thoughts, we should have a conversation about this. Which is why it's a book.‍

[Auren Hoffman] When people present data, and sometimes you're saying, like, X number of people agree with this, or just it's sometimes hard to tease out, like, how true it is. Because, you know, there's no way to sample everyone in the world, or everyone in a certain population or something like that. Even the US Census, like misses tons of people and has lots of different flaws. Like how does one, how does one telegraph their biases in a data set to a data scientist in a way so that data scientists can understand some of those biases, and then do their job to account for the biases.‍

[Hilary Mason] I mean, in this sort of work, which sounds like, you know, you're doing some kind of market surveying, there are, it happens on two sides, and one is good design of the survey instruments. And then on the other side, robust reporting of error metrics. So often, you'll see in the the news or a published, you know, sort of marketing type report like, you know, 98% of customers were not dead after using our product or something. So it's really doing the reporting around that so that people can understand like, how many people did you, what was your sample? Like, what is the denominator of that sample? Like, how did you even come to this conclusion? And, you know, where does it fall frankly, on the bullshit spectrum? And trying to separate that out from, you know, the work itself and then how it may be presented or reported as well. ‍

[Auren Hoffman] Julia Galef has this really great framework of like the scout mindset versus the soldier mindset. And, you know, one of the worries sometimes from data science is it could just weaponize the soldier or it kind of allows us to prove things that we already believe. Like, how do we guard against that? And how do we, is there some sort of data science profession that we can think through on that front?‍

[Hilary Mason] So I think it really comes down to what does excellence in the role look like? And you're absolutely right, I think this is both pretty common, and also a dangerous failure mode of the profession. But it also renders the profession useless. Because if, you know, we're just taking what everyone knows already. And, you know, proving it, what's the point? Like? Why have you invested in paying data scientists who are pretty expensive, by the way, and all that compute and all that data? Like, you might as well just get to the end result with none of none of the work. I think you see this too in the the sophistication and the successful outcomes that different organizations, the ones who do this well, who have a lot of open mindedness to change action based on what comes out of the analysis. Like, they'll do better than the ones who are just spending money to be told what they already know, or already think, you know, they will not do well. Plus, I also, like, great data, scientists tend to leave the latter kind of organization. So you also have this sort of polarity of talent, where when you have a team that's really good, it attracts lots of other good people. And when you have a team that's sort of in this, you know, a mediocre spot.‍

[Auren Hoffman] But I mean, in some ways, all humans are, you know, we tend to want to prove things, we already believe. That that is a human trait. And maybe some people do that a little bit more than others. But all of us fall into that trap. Like, is there some sort of hack that a good data scientist could do? So they're less likely to fall into that trap?‍

[Hilary Mason] I mean, I think it's the scientific method. And it's the same in science. And you know, I'm both a CEO and a data scientist. So I think about this kind of thing a lot, where I'm like, if I'm making a decision about what we're doing in our business, is there data that would potentially change my mind or not? And if it's not, I'm not going to bother doing the work. Or is this like a big enough decision to even justify that work, which we haven't talked about at all, by the way. There are plenty of problems out there where data could inform and potentially improve the outcome, but actually, it's not worth it for a bunch of reasons. So it really is keeping as much as you can that intellectual rigor around what fundamentally is the scientific method, like I have a hypothesis, I agree that this is a way to frame the question and have an answer with robust error metrics that can either inform what I already think or not. Then I'm going to do the work and go from there. And I also for data scientists, I find that the questions that you can come up with where you don't have a hypothesis are actually the ones that tend to be more meaty and fruitful in the end, because a lot of stuff about most businesses is pretty obvious. You know, and this might be things like thinking back to social media data, like people post on social media more when they're awake than when they're asleep. Yes, like, these are the kinds of questions you can end up asking, and validating. And maybe that's not so useful to spend the effort on. But maybe asking questions, like, are different topics posted on social media at different times of day? And it turns out, yes, like sports, at least a decade ago, this was the case like sports ended up having a very predictable pattern. Where it's sort of like generic news happens, you know, all the time, right? Celebrity gossip all the time. So, you know, it's really trying to dig into those in a way that's honest and useful. But also, you know, if you end up finding yourself asking questions where the answer is not gonna change what you do anyway, like don't do the work.‍

[Auren Hoffman] Back to interviewing, because I know a lot of people are really thinking about how to interview well for a data scientist, and I heard a talk where you mentioned that companies really should do a better job of screening for ethics. When hiring data scientists like how do you do that? Like, I could see if you worked with somebody for a while, but how do you do the interview process?‍

[Hilary Mason] Again, try to come at it from a bunch of directions. So ask people things like you know, tell me about a time when you have disagreed with a colleague and how you resolve that, and then I also like to put a few questions in our technical interviews around things like, let's say you're building a classifier to, and we'll make it pretty blunt, right to decide who's going to get a loan and who's not or how big the loan will be. And it turns out that based on historical data, race is the biggest factor in, you know, accurately deciding who's going to pay their loan back and how much money you should lend them. What do you do? And the right answer here is I take race out of the model, I accept the lower accuracy. The wrong answer is, well, I take race out of the model, but I put zip code in because it correlates with race. And I just know that, and everybody knows that. Right? So like, just looking for whether they even blink at something that, is it about are you meeting the letter of the law? Or are you meeting the the sort of spirit and values behind the question? And, and he would really be surprised probably by how many people like don't even blink and assume accuracy is all they're there to optimize for? When, in my view, they are not.‍

[Auren Hoffman] Because for many data scientists, in some ways, like you've been trained to optimize for accuracy ever since, you know, ninth grade math or something. And now you need to take some other like more important societal things into account as well.‍

[Hilary Mason] And I also like to ask questions around, you know, whatever product we're building, like, how do you think this impacts people? What are the things we should be worried about? And so again, it's asking multiple questions to just sort of understand how they think about the role of their work in the world. And I clearly sit in a position where I think data scientists should be pretty strong values to the work because if they don't, the potential for harm is great. And that organizations exist as a structure to support and buffer people expressing those values.‍

[Auren Hoffman] I would love to dive in a couple personal questions. So first of all, like your work on this new thing, Hidden Door, it's really cool. Like, can you just tell us a little bit more about it?

‍[Hilary Mason] Sure. So at Hidden Door, we are building a social role playing game for tweens and teens to essentially take the kinds of storytelling play you do in Dungeons and Dragons, and in, say, writing fanfiction with your friends, and to do it online. And we're doing it because storytelling is the kind of play we have all done forever. But for the first time, we have natural language processing technology that can actually facilitate and enhance this kind of play. And so we're building tools that allow people to take the stories they have in their minds, but they may not be able to express yet in language and sort of play with those stories with their friends. So it's something that..

‍[Auren Hoffman] Hold on. We need to double click on this first thing because when I was a tween, I was a huge Dungeons and Dragons person. Probably not surprising to a lot of people. You too? Okay, great. To me, that seems like the last thing like computers could help with. How does the computer aid in that or what how does it work on that? By the way, why are you just doing this for tweens? Like, I want to play this.‍

[Hilary Mason] Well, that's a separate question. Thinking about the role of somebody who runs a tabletop RPG, so we'll call them the game guide, they do a bunch of things.‍

[Auren Hoffman] The dungeon master.‍

[Hilary Mason] Right, they progress the story, they set goals, they enforce the laws of physics. So you might say like, you know, my character picks up this teacup and throws it at you, and I'll say, cool, you can do that, but roll to see what the damage is. Whereas if I say, you know, I pick up this building and throw it at you, you're gonna say, okay, no, you can't do that. Right. This is the kind of capability where we firmly believe that systems are not creative, even deep learning systems that have ingested millions of stories and role playing sessions, not creative. What they are very good at, is providing a palette of options where the players are able to say, I think this sort of thing should happen next, the system fills in details, and then they're like, oh, no, no, actually, it should go over this way. And by the way, we're doing this together with friends. So we're talking about it, we're collaborating. And this is ultimately a co-op experience where we're telling the story together. And so systems are not, they're not writing compelling novels. And if you need, you know, you can go to the internet and see that there are millions of stories out there. And we don't need more stories. What we need are tools that help us tell our stories, more fastly more socially. And that's where we're at with this bit of tech. Why tweens and teens? Frankly, because it's where a lot of the storytelling play happens. Where folks are starting to experiment with aspects of identity or you know, using these games as tools for saying like, what if I was a really brawny sort of person? Or what if I was the scholar? You know, who's really into this, like, obscure kind of magic, right? Or, you know, it's not even all a fantasy setting, right? What if I'm in the boarding school and you know, I'm babysitting for a friend, and we're having a relationship issue. Like, it's a way it's a time in people's lives where the ability to try on an identity and roleplay it is something that is incredibly powerful. And I am personally very invested in building the tools to allow for that.‍

[Auren Hoffman] Now, you have 126,000 followers on Twitter last time I checked. You’re hmason on Twitter. When you first started tweeting, did you have any idea there was such a big market for data nerds. If someone else is a data nerd and wants to increase their public profile, how does one go about doing that?‍

[Hilary Mason] So to answer the question. When I started started tweeting, Twitter was not that big. So no, I had no idea.‍

[Auren Hoffman] It may have been less than 126,000 people on Twitter. ‍

[Hilary Mason] Yeah, there definitely was. It was also the kind of thing where if you go back and look at my stuff from 2007, it’s like I made a bunch of cookies. It’s a lot of, why are zip codes so weird? It’s a lot of that kind of stuff. As for people who want to have a public profile, first I would say think about whether you really want to because it’s not always a positive. Basically you get people retelling your own jokes back to you, but dumber. Like every time you say anything. But the other side of it is I meet so many interesting people and have so many random connections because of it. I just can’t give it up. It’s too much fun. And then for people who want to have a public profile, it involves having opinions on things. That’s not to say hot takes on the news of the day, but really having a point of view on a practice that people are interested in and then sharing that point of view where it’s interesting and helpful. And of all the stuff that I share, which is not all about data. I am a huge pro New York nerd, so a fair bit of New York stuff in there. A fair bit of just history of tech, especially and art. And then about data. The things people seem to respond to are the things where you sort of say a point of view on something clearly, so people can build off of it and develop their own points of view. It’s best if it’s based on experience or excitement. The other thing about having a lot of Twitter followers that I really try to do is highlight interesting work done by people who do not have 126,000 Twitter followers, so they can get some of that attention. Particularly lots of diverse folks do really interesting data science stuff and whatever they’re super nerdy about. I like to pull that out and highlight it. ‍

[Auren Hoffman] Oh cool. Last question we ask all of our guests. If you could go back in time, what advice do you wish someone had given to your younger self? Let’s say [Hilary Mason] at 18. ‍

[Hilary Mason] 18, I would say when a cute person asks you out for coffee, you don’t have to drink the coffee and you should go. Because I was sort of an idiot. ‍

[Auren Hoffman] Woah, wait. Let’s go back. What do you mean by that? ‍

[Hilary Mason] I probably missed out on a lot of social opportunities because people would say let’s get a coffee and I would say I don’t drink coffee. ‍

[Auren Hoffman] Oh, oh, got it. You didn’t get the next level. Ok. ‍

[Hilary Mason] Yeah, so the advice I would give myself back in time is that I have been extraordinarily lucky and a lot of that luck has been a function of pursuing the things that I was already interested in and excited about and then meeting the people who were also excited about those things without being ashamed of them. I remember a time, alongside the lines of the coffee thing, when telling someone at a party I was working on machine learning was enough for them to be like, ok I’m going over there. Nobody thought it was cool. It was not cool. But I was interested in it and it was enough to keep that passion and excitement. And I still have that. I would say find the things that you’re interested in. Find your people because they are out there. That involve going somewhere like New York or San Francisco or online somewhere where there’s people gathered. Find your people. And just pursue those interests because they’re interesting to you and that will give you, whether it becomes your job or not, it will give you enough interesting and valuable perspective on something that you will be able to apply it to other areas in a way that will be really useful and interesting. In my own career, I have been a data scientist, I’ve run data science teams, I’ve been a CEO, I’ve built products. But having that deep interesting in machine learning and data science and computer science and things like operating systems, even though it’s pretty specific and nerdy, has become very useful in these other domains as well. So I’d just say find that thing you’re super into, find your people, and then have fun with it. Never lose the fun. ‍

[Auren Hoffman] Well this is great advice. Thank you very much. Besides for people following you at @hmason on Twitter, where else can people learn more about you? ‍

[Hilary Mason] Yes. So our company Hidden Door is at hiddendoor.co, and you find more about me at hilarymason.com ‍

[Auren Hoffman] Awesome, thank you so much for being with us on World of DaaS.‍

[Hilary Mason] Thank you. This has been a lot of fun.

Transcript

Hilary Mason, co-founder of Hidden Door and data scientist in residence at Accel Partners, talks with World of DaaS host Auren Hoffman. Hilary previously co-founded Fast Forward Labs, which was acquired by Cloudera, and served as the Chief Scientist at bit.ly. Auren and Hilary explore how data science has progressed in the past decade, the role of data science in an organization, and data ethics.

World of DaaS is brought to you by SafeGraph & Flex Capital. For more episodes, visit safegraph.com/podcasts.

‍

Listen on

Apple Podcasts

Spotify

Youtube

Share this episode:

Hilary Mason: The Rise of Data Science

Listen to more great episodes

Building Esri: The Relationship Between Geospatial Data and Software

Stephen Orban: Data Marketplace Innovation at Amazon