The ongoing IRS scandal, in which various groups were targeted according to keywords such as “tea party” in the search for infractions, has important lessons for emerging big data techniques. Because analytics based on huge amounts of streaming social data linked with demographics provide the possibility of creating social profiles, a door has been opened for new types of abuse that may create invisible legal issues. Profiling has always been problematic, but it has generally been overt, and the result of a conscious decision. Invisible social profiles, on the other hand, might not even be fully understood by the analyst.
The IRS example is fairly primitive, and is actually at least partly a response to reduced funding and fewer personnel, with a decision being made for greater use of analytics. Keyword analysis is rather simple, and it was not applied in a very refined way. But it did create political issues, and an appearance of impropriety. And that was just identifying politics. The Chicago Police Department uses analytics to profile “potential criminals.” This verges on the territory of the movie Minority Report, in which people are punished on the basis of crimes it is predicted they will commit.
Profiling is essential for prediction, and the profile is a key part of the modeling process that goes into the discovery phase of analytics. As discovery becomes faster and more autonomous, handling gigantic quantities of social data, profiles will be created with behavioral probabilities attached. These profiles will feed predictions, which might lead to autonomous action. But what if the key identifier of the profile is race, color, or religion? What if it is political belief, socioeconomic status, or education? Since actions are taken against a profile to modify or prevent behaviors, the profile could also be self-reinforcing.
Profiling abuse is an important issue for big data as it turns to complex social streams, but it is not the only one. Another issue that was previously understood but now becoming much more problematic is the ability to identify individuals in “anonymous” data streams. There is a considerable amount of information about people available in the ether, and the streams from social media are filling in a tremendous amount of detail. People express their hopes and desires, talk to their friends, and discuss their expectations. This can be linked with demographic information, census information, and other data to profile a single individual — or a range of single individuals — and model their behavior, or target them for observation in a manner that exceeds the permissions they thought they were giving for use of their data and for privacy.
There are a lot of related issues, of course, and far more than can be covered here. But it is of increasing urgency that companies examine their own use of data and develop polices for usage that do not create illegal profiles or transcend the boundaries of legitimate use for personally identifiable data (PID).
Welcome to the era of ever deeper learning, and the beginning of a new Age of Discovery!