A killer AI has gone on a rampage through Pakistan, slaughtering perhaps thousands of people. At least that’s the impression you’d get if you read this report from Ars Technica (based on NSA documents leaked by The Intercept), which claims that a machine learning algorithm guiding U.S. drones – unfortunately named ‘SKYNET’ – could have wrongly targeted numerous innocent civilians.
Let’s start with the facts. For the last decade or so, the United States has used unmanned drones to attack militants in Pakistan. The number of kills is unknown, but estimates start at over a thousand and range up to maybe four thousand. A key problem for the intelligence services is finding the right people to kill, since the militants are mixed in with the general population and not just sitting in camp together waiting to be bombed.
One thing they have is data, which apparently includes metadata from 55 million mobile phone users in Pakistan. For each user they could see which cell towers were pinged, how they moved, who they called, who called them, how long they spent on calls, when phones were switched off, and any of several dozen other statistics. That opened up a possible route for machine learning, neatly summarised on slide 2 of this deck. If we know that some of these 55 million people are couriers, can an algorithm find patterns in their behaviour and spot others who act in a similar way?
What exactly is a ‘courier’ anyway? This is important to understanding some of the errors that The Intercept and Ars Technica made. Courier isn’t a synonym for ‘terrorist’ as such - it means a specific kind of agent. Terrorist groups are justifiably nervous about using digital communications, and so a lot of messages are still delivered by hand, by couriers. Bin Laden made extensive use of couriers to pass information around, and it was through one of them – Abu Ahmed al-Kuwaiti (an alias) - that he was eventually found.
That’s who the AI was being trained to detect – not the bin Ladens but the al-Kuwaitis. Not the targets so much as the people who might lead agents to them. Ars Technica implies that somehow the output of this courier detection method was used directly to “generate the final kill list” for drone strikes, but there’s zero evidence I can see that this was ever the case, and it would make almost no sense given what the algorithm was actually looking for - you don’t blow up your leads.
How did it work? The NSA tried several classification algorithms, and chose what’s known as a random forest approach. It’s actually pretty simple to describe. You have 55 million records, each with 80 different variables or ‘features’ in them. A random forest algorithm splits this data up into lots of random overlapping bundles of records and features. So you might end up with e.g.:
-
Batch 1: ‘average call length’ and ‘number of cell towers visited’ for a million randomly selected people.
-
Batch 2: ‘daily incoming voice minutes’ and ‘daily outgoing minutes’ for another million randomly selected people.
-
[…lots more batches...]
- Batch N: ‘number of cell towers visited’ and ‘daily outgoing minutes’ and ‘age of person’ for another million randomly selected people.
And so on. The next step is to train a decision tree on each bundle of data. A decision tree is, very crudely speaking, an algorithm that takes a record with a number of variables and goes through a series of yes/no questions to reach an answer. So for example, ‘if this variable is > x1 and that variable is not > x2 and ‘a third variable’ is > x3…’ (...and so on for perhaps dozens of steps...) ‘…then this record is a courier.’ The exact values for all the ‘x’s used are learned by training the algorithm on some test data where the outcomes are known, and you can think of them collectively as a model of the real world.
Having created all those trees, you then bring them together to create your metaphorical forest. You run every single tree on each record, and combine the results from all of them to get some probability that the record is a courier. Very broadly speaking, the more the trees agree, the higher the probability is. Obviously this is a really simplified explanation, but hopefully it’s enough to show that we’re not talking about a mysterious black box here.
How well did the algorithm do? Both The Intercept and Ars Technica leapt on the fact that the person with the highest probability of being a courier that they found in the data was Ahmad Zaidan, a bureau chief for Al-Jazeera in Islamabad. Cue snorts of derision from Ars Technica:
“As The Intercept reported, Zaidan frequently travels to regions with known terrorist activity in order to interview insurgents and report the news. But rather than questioning the machine learning that produced such a bizarre result, the NSA engineers behind the algorithm instead trumpeted Zaidan as an example of a SKYNET success in their in-house presentation, including a slide that labelled Zaidan as a ‘MEMBER OF AL-QA’IDA.’”
If you knew nothing about machine learning, or you ignored the goals the algorithm was actually set, it might seem like a bad result. Actually it isn’t. Let’s ignore the NSA’s prior beliefs about the man. The algorithm was trained to look for ‘couriers’, people who carry messages to and from Al Qaida members. As a journalist, Zaidan was so well connected with Al Qaida members that he interviewed Bin Laden on at least two occasions. This was a man who regularly travelled to, spoke with and carried messages from Al Qaida members.
If the purpose of the algorithm had been narrowly to ‘detect terrorists’ or ‘identify suicide bombers’ then The Intercept might have a point. But it wasn’t. It was trained to find people tightly linked to Al Qaida who might be carrying useful intelligence. Its identification of Zaidan – regardless of whether he was acting as a journalist or not – was entirely correct within the context of those goals.
(As an aside, obviously I’m not making any moral statement here about the validity of intelligence agencies tracking journalists and intercepting their communications. I’m talking simply about the performance of the algorithm in carrying out the objectives it was set.)
So the one case that The Intercept and Ars Technica highlight as a failure of the algorithm is actually a pretty striking success story. Zaidan is exactly the kind of person the NSA would expect and want the algorithm to highlight. Of course it’s just one example thought, so how well did the algorithm perform over the rest of the data?
The answer is: actually pretty well. The challenge here is pretty enormous because while the NSA has data on millions of people, only a tiny handful of them are confirmed couriers. With so little information, it’s pretty hard to create a balanced set of data to train an algorithm on – an AI could just classify everyone as innocent and still claim to be over 99.99% accurate. A machine learning algorithm’s basic job is to build a model of the world it sees, and when you have so few examples to learn from it can be a very cloudy view.
In the end though they were able to train a model with a false positive rate – the number of people wrongly classed as terrorists - of just 0.008%. That’s a pretty good achievement, but given the size of Pakistan’s population it still means about 15,000 people being wrongly classified as couriers. If you were basing a kill list on that, it would be pretty bloody awful.
Here’s where The Intercept and Ars Technica really go off the deep end. The last slide of the deck (from June 2012) clearly states that these are preliminary results. The title paraphrases the conclusion to every other research study ever: “We’re on the right track, but much remains to be done.” This was an experiment in courier detection and a work in progress, and yet the two publications not only pretend that it was a deployed system, but also imply that the algorithm was used to generate a kill list for drone strokes. You can’t prove a negative of course, but there’s zero evidence here to substantiate the story.
In reality of course you would combine the results from this kind of analysis with other intelligence, which is exactly what the NSA do – another slide shows that ‘courier machine learning models’ are just one small component of a much larger suite of data analytics used to identify targets, as you’d expect. And of course data analytics will in turn be just one part of a broader intelligence processing effort. Nobody is being killed because of a flaky algorithm. The NSA couldn’t be that stupid and still actually be capable of finding Pakistan on a map.
It’s a shame, because there’s a lot to pick apart in this story, from ethical questions about bulk data gathering and tracking journalists to technical ones. Realistically, how well can you evaluate an algorithm when the original data contains so many people whose classification is unknown? And is ‘courier’ a clear cut category to begin with, or an ever-changing ‘fuzzy’ set?
Finally, it’s a great example of why often the most important thing in artificial intelligence isn’t the fancy algorithms you use but having a really well-defined and well-understood question to start with. It’s only when you fully understand the question that you can truly evaluate the results, as Ars Technica and The Intercept have neatly demonstrated.