It’s been eight years since DJ Patil — then the data and analytics lead at LinkedIn — helped coin the term “data scientist,” and the profession has already become one of the most popular in the country.
Patil has long been involved in the data industry. As a doctoral student and later faculty member at the University of Maryland, he used open datasets from NOAA to help improve numerical weather forecasting. He was the director of strategy, analytics, and products at eBay and later spent nearly three years at LinkedIn. He’s written books on the culture of data and building data products.
Last year, the White House appointed Patil as its first chief data scientist and deputy chief technology officer for data policy in the Office of Science and Technology Policy.
We chatted with Patil about what got him interested in data, what being a “data scientist” means and where he sees the industry going.
How did you first get started working with data?
Courtesy of the White House
I really had this moment of deciding to actually learn it, and also to impress my girlfriend. I kind of picked it up really quickly, and I fell in love with math.
From there I transferred to UCSD, where I started really working on a lot of data aspects around chaos theory. From there, I went to the University of Maryland, the home of chaos theory, and one of my advisors was Jim Yorke, who coined the term "chaos theory."
We started working on weather forecasting. We really stumbled across the idea that weather was not as chaotic as people had previously believed. The way we did that was by me going in every night at around 8, 9 p.m., taking over every computer in the math department secretly, and then downloading all of this data from the National Weather Service, ripping it apart, putting it together in different ways — and then leaving before 8 a.m., when anybody would come in.
And that allowed us to find these really interesting patterns. That was an "a-ha!" moment for me. I realized, "Oh wow, you can do really incredible things if you’re able to go get data." After we did that, that became one of the major techniques used in weather forecasting.
You then helped to coin the term ‘data scientist’ (with Jeff Hammerbacher, then the data manager at Facebook), right?
Yes. It’s good and bad. I think there’s this interesting question of, "Well, what is a data scientist? Isn’t that just a scientist? Don’t scientists just use data?" So what does that term even mean?
You’ve had one of my co-authors, Hilary Mason, on the show, and the thing we joke about and we wrote about together, is that the number one thing about data scientists’ job description is that it’s amorphous. There’s no specific thing that you do; the work kind of embodies all these different things. You do whatever you need to do to solve a problem.
If you’re building a self-driving car, who are those people who are building the self-driving car? They’re data scientists — whether they’re product managers, designers, whatever they are. They’re the people who are using these techniques and ideas from economics, from statistics, from machine-learning, from artificial intelligence, from all these disciplines to specifically make it work, to make the car drive in a way that keeps you safe and others safe as well.
The best data scientists have one thing in common: unbelievable curiosity.
How has the data industry changed, and why do you think it’s become popular to be a data scientist?
I think the reason the data science aspect has really blossomed now is, one, people are able to collect data far easier than before; it’s not a lot of effort to do it. The second is, now that people can collect sufficient amount of data, there’s this question of, okay, so what are we supposed to do with it? And who‘s actually going to do this?
How do you think the White House came to realize it needed a chief data scientist?
Well, one of the things that people haven’t always really taken into consideration is how much focus this president has put on data from day one. Even if you step back in his campaign, he’s very focused on using data in novel ways to engage with the public. Coming into the administration, he’s been focused on everything from how do patients get more access to data, to how to we ensure that we’re using data for transparency — increasing the amount of data that’s open out there.
We’ve created data.gov, where there are almost 200,000 datasets that are available for everyone to look at. How do we use data to improve services for everybody? In fact, [President Obama] has an executive order that all governmental data by default is open and machine-readable, and that data that is published using federal research dollars should be free, because who paid for it? The taxpayers. (There’s a time window where we want the [health] journals to be able to have exclusive access, but over the long term, the public shouldn’t have to pay for that.)
Just like he was the first president to have a chief technology officer, he’s recognized that there needs to be a team that is focused on how do we unleash the power of data to really benefit every single American.
You’ve now held this position for over a year. What’s your proudest achievement so far?
The achievement I’m most proud of so far is that data scientists are now heavily, heavily engaged in working on these problems, and so many of the federal agencies now have a data team or a chief data scientist or a chief data officer. Take transportation, for example. They have a chief data officer who’s focused on, how does the Department of Transportation think in a novel way about this? The National Institutes of Health have a person who is focused on new ways of thinking about data. So does the US Department of Agriculture. Even USAID. So everyone is thinking about data as a force multiplier.
Where do you see the future of the data industry going?
The most exciting thing for me about the future is how data is going to be part of every single conversation, and that we will make faster, higher-quality decisions as a result of that. What will happen is, we won’t just look at data once every 10 years to evaluate something — we’ll be looking at data very regularly and course-correcting in much more real time.
And that will allow us to have government provide better services and be more agile.
What advice do you have for someone who wants to become a data scientist?
There is never a better time to start. Just go to data.gov. There are nearly 200,000 datasets where if you just start downloading them, play with them. One of the coolest things that you can do now is work with data at your local city level.
There’s a National Day of Civic Hacking [on June 4], and what’s going to happen on that day around the entire country is, people are going to have a hackathon in their local town, they’re going to work on data at the local level. They get to use that data to improve their local communities.
What do you think are some of the biggest challenges facing the data industry?
Something that I think is really important, that I called for, is every single training program — whether it’s undergraduate, graduate, or online courses in data science — must have data ethics as not some elective, but as a central tenet of how we do things.
When we do work with data, you have incredible opportunities to do great things with it, and you also have the ability to do something that could be very problematic. We’re seeing where people have used data in ways that we think are fundamentally not OK. People have started to talk about this and what we should do about it. I think we have to have a much stronger conversation. Privacy components are equally important.
I also think we have to train a lot more people to use data. "Use data" means how to read a graph at the very basic level, all the way to doing very sophisticated things. Empowering people with data in their daily lives gets people to be in better control of their destiny. That could be something as simple as, "How do you choose college?" That’s why we work so hard with the Department of Education to build the College Scorecard, which gives people transparency in a novel way.
Do you ever get any backlash in your role?
The biggest backlash I think there is is how do we manage the privacy aspect of this, and how do we simultaneously think about cybersecurity? The reason I don’t think backlash is quite the right word is because everyone recognizes the value here, so it’s not a "but" — it’s an "and." How do use data and preserve privacy and ensure cybersecurity? I haven’t gotten anybody who’s angry at the problems we’re working on; I think what we have as a problem is, Why aren’t you working on that? Maybe that’s the biggest backlash.
So how do you deal with those concerns about privacy and cybersecurity?
I’m very focused on them. In fact, they’re integrated in everything we’ve done. For example, in the Precision Medicine Initiative, we released privacy and trust principles that, we believe, are going to be the app going forward for anybody who is doing this kind of biomedical research. And then we released the draft security framework for any of this type of research going forward, and we’ll be finalizing that very shortly. So, we practice what we preach, in that data ethics is an incredible component of every single thing we do.