Recent discovery on Strava heat map points out the ease of leaking data through social media platforms

February 8, 2018

By Charles BergquistStephen Schmidt

Activity using Strava's tracking technologies such as the one above has helped the company produce a heat map of the world using one billion total activities.
Paul Downey / CC BY 2.0

In early November, Strava — a technology company that can track athletic activity for its users through its website and app — released an article on Medium that proudly announced its first major global “heat map” that the company had released since 2015.

The article contains a list of astounding numbers including a billion activities, three trillion latitude/longitude points, 13 trillion rasterized pixels, 10 terabytes of raw input data and 17 billion miles covered.

Then an Australian college student, Nathan Ruser, pointed out on Twitter that he was able to make out the location of several undisclosed US bases in the Middle East. That one tweet sent the internet — and the US government — into a tailspin, while raising questions over the power that data points (such as geotagging on social media apps) can have to draw inferences about users’ private worlds.

It turns out there is a big difference between what data we think we are emitting into the digital universe and what actually is collected with each post, picture or GPS data point, says Gavin Sheridan, the CEO co-founder of Vizlegal and an open-source intelligence specialist.

“People will usually — whether it's a Google satellite image or a Strava map — zoom in on places that they're familiar with, so they look at where they live and they look at places that they've run themselves and they'll see what other people are doing,” Sheridan says. “But the problem with that is essentially that they publish it for the whole world.”

Zeynep Tufekci, an associate professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill says the biggest issue is not knowing what data points will be used in conjunction with one another by a third party.

"I think this is a really interesting case that shows we're all to blame and nobody is to blame because it's very hard to predict what any piece of data will be revealed — not on its own but when it's joined with all sorts of other data,” Tufekci says.

There could be reasonable use cases for sending out that data, such as learning about a new running or biking trail, she says, but other inadvertent revelations (such as the location of a military base) could have unattended consequences.

“It shows more than what anybody bargained for,” Tufekci says of the combined data points being collected from social media and tracking software.

The technology companies themselves also did not know what they were getting into as the advancements in algorithms and artificial intelligence continue to be released into the digital ethos.

"This is our problem: Our privacy protections depend on this alleged informed consent, but the companies are in no position to inform us because they don't know what the data is going to be doing out in the world,” Tufekci says. “So we're not in any position to consent to what we cannot comprehend is going to happen.”

Sheridan often gives the particular example of employees taking geotagged selfies and posting them on their first day at a new job, a common practice. Many social media platforms run on data sets called an application programming interface (API). Those knowledgeable in the field of APIs, like Sheridan, can extract information about all of the users who are posting the photos — and even glean intelligence about the behavior of that particular company at the moment, all from a single image.

"I could look through who they follow and who follows them and look at the interconnectedness of your social graph and figure out something about that person and about who they are and about what their interested in, about how old they are, without that person necessarily believing that the information is possible to be derived from what they shared,” Sheridan says. “They're not necessarily understanding what they think might be relatively inane or innocuous information can be extrapolated.”

This type of ability only touches on the surface about what some machines can infer using certain algorithms, Tufekci says. She and her team have published research about how artificial intelligence, with Facebook likes alone, can reliably infer about a user’s race, gender, sexual orientation — and more obscure information, including a person’s likelihood for substance abuse or to be in a state of depression.

“When you put your data out in the world, it's not just the data you're putting out,” she says. “You're letting machine intelligence and algorithms churn through it and figure things out about you.”

At the moment, most of those algorithms are being run for the sake of being able to directly target consumers based on their needs and to get users to click on sites. The purpose of such technology is starting to expand, though, into deep vetting of possible employees by some companies in which they gather data on the likelihood of someone, for instance, being prone to unionizing, getting pregnant or having illnesses or other serious health issues.

“I think this is a moment in which there has to be a real good reason for data to be stored just because we don't have a handle on what's going on,” Tufekci says.

Both Tufekci and Sheridan believe there are safer ways to collect data from users while still providing levels of protection of individual identification through encryption.

“The problem is at the moment Silicon Valley is basically minting money with the current collect-everything-and-do-whatever-you-want model,” Tufekci says, “so they're not really incentivized to provide us with these services.”

Adds Sheridan: "It's like we're in year zero of social media where we think it's been around for a long time but actually we're at the very start. And I think one thing is how do we interrogate the platforms that we're using to oblige them perhaps to tell those what they know about us. … I think that‘s the [big] question for the next five years is how will the platforms be proactive about telling us what they are doing in real time, not just retrospectively.”

This article is based on an interview on PRI's Science Friday with Ira Flatow.