New years entity resolutions

Just in case you think that entity resolution problems (matching up names appearing in multiple data sources, while not falsely assuming that everyone named “John Smith” is the same person) are purely an academic concern, I recently got an email from an airline announcing TSA’s new Secure Flight program and asking me to provide them with birth date and gender information when making a reservation:

How Does Secure Flight Make Travel Better for Me?
The TSA determined that making date of birth and gender mandatory data elements would greatly reduce the number of passengers misidentified as a match to the watch list when fully implemented. It is to your advantage to provide the above information to potentially prevent delays or inconveniences at the airport, especially for those individuals who have similar names to those on the
watch lists.

Yup, we could have told them that. Matching names correctly is hard problem. There are lots of false positives. It helps a lot if you can have additional context information in a standardize form for each name to disambiguate the names. And eventually you probably need to create some sort of ID system to use to group the names, aliases, and context info in the database in order to separate the people who have the same name. Oh wait:

The TSA-assigned Redress Number is assigned to Customers who believe they have been mistakenly matched to a name on the watch list to prevent misidentification. Customers who have previously contacted Department of Homeland Security (DHS) and received a redress letter will should soon receive a new letter from DHS that will include the Redress Number for you to enter into your account. The control number is not the same as a Redress Number.

Ok. I guess that is a step forward. At least they are finally admitting that there can be problems with this kind of data match system, and have provided a way to get “unmatched”. But it is an interesting “prove you are not a terrorist” precedent. The TSA site also states the program is designed to move the matching process away from the airlines (so they won’t have the no-fly lists) and centralize the matching in a consistent TSA system. This might be a good thing. But one troubling aspect of this of course is that even if you are not on a no-fly list, you are still providing TSA highly-detailed, highly-matchable record of all your airline travel. But don’t worry they have a exempted from many of the provisions that normally govern the collection of personal information.

TSA is claiming the following exemptions for certain records within the Secure Flight Records system: 5 U.S.C. 552a(c)(3) and (4); (d)(1), (2), (3), and (4); (e)(1), (2), (3), (4)(G) through (I), (5), and (8); (f), and (g).

According to their response to concerns filed by EFF and EPIC, most of the rules they claim exemption from have to do with a person’s right know what data has been collected about them so that it can be corrected. TSA asserts that we don’t have that right, because there might be classified information in the database. But wait, if there is incorrect data, or I am falsely matched, wouldn’t that be depriving me of a right to travel? TSA claims that even if the data is wrong, preventing you from flying is not a violation because flying as a particular mode of travel is not a right guaranteed in the Constitution — after all, nobody is stopping you from walking or riding your horse across the country, right?

While I’m sure the claim that the founders of the constitution did not envision air travel is correct, i think it is probably also true that they did not envision the construction of massive databases capable of tracking and recording the travel history of all citizens, even if they have never been suspected of any crime. My limited skimming of the documents suggests that data the TSA is collecting is not limited to name, gender and birth date, I think the airlines are required to provide phone #, address, email, and itinerary if they have it–they just are not required to collect it directly from passengers (or to notify us that that information will be provided to the database).

Another thing I find disturbing is that in TSA’s “Myth Busters” section there are several statements like:

Drew Griffin of CNN repeatedly says he is on the watch list but he is not.

I’m sure Drew is relieved to know that he’s not suspected of being a terrorist, but TSA’s response doesn’t refute the claim that Drew is getting searched and hassled on many flights without having done anything wrong, presumably because of a name match with someone in the database who is suspected of something.

If the various security agencies are only now beginning to implement some of the crudest best practices in data management, and are evasive when confronted with errors, why should we trust that they are doing a good job at protecting our civil liberties? I’m sure that there are a few dangerous people in the world who would like to cause great harm, but is this really going to keep us safe from them? Won’t they use a stolen id anyway? If it is really about security, wouldn’t our resources be better spent investigating those 2,500 “dangerous” people without building systems that have strong potential for repressive miss-use? Even if the actual plans for the system are not “evil”, once that data exists it is more likely to get misused.

How many records are we talking about here? Another TSA page seems to make an indirect assertion that the “watch lists” contain:

  • 2,500 people who are too dangerous to be allowed to fly
  • 17,500 people who should be searched extra carefully and interrogated
  • 400,000 people on the “overall consolidated watch list”
  • 700,000 “records” (aliases, etc?) for those 400,000 people
  • They did some data cleanup in fall 2007, before that there were twice as many records.

Name matching is messy. Even with additional context, matching 100 million+ travelers to 400,000 names is going to have errors. For example, my last name contains a hyphen, and I have no middle name. As far as I can tell, most credit card systems do not process hyphens. So in order to get my name in the reservation systems to match the name on the credit card (often important if you want to pull up your ticket at one of those e-checkin kiosks) I have leave out the hyphen. But makes it so that many reservation systems assume that the first part of my last name is actually a middle name, thereby mangling my last name so that it doesn’t match the credit record. Catch 22. Ordinarily this makes me chuckle ’cause I imagine that the programmer who wrote the name parser for the system never thought of my case, and will probably never find out that their code is broken. I just wave to the check-in clerk and ask them to fix the problem. But it is a little more sobering if i imagine that this dirty data is getting pushed to some mysterious database where I have no way to verify it. And if the system can’t deal with a hypen, what if I said my middle name was “delete from terrorist_threats”, will the whole system crash? (joke)

Ok seriously. There lots of huge databases with sensitive personal info out there. Probably I shouldn’t trust Google, Facebook, phone companies or credit agencies either, they sell my data in all kinds of evil ways. Heck, I work with large databases of sensitive information about people. Why does this TSA/Counter-terrorism stuff seem so sinister? Perhaps the construction massive national identity systems would seem less dangerous if we were not also in a period of renewed anti-immigrant fervor, red scares, green scares, etc. I believe that actual security comes from trust, openness, and transparency. Security through obscurity has been proven to fail over and over again. And when ever someone says “trust me” and is sketchy about the details of a very hard problem, I get nervous.

So if you are ever planning on going on vacation, don’t name your kid “Muhammad Atta”, but also don’t name them “John Smith”, ’cause you never know when some other John Smith might do something wrong and get you kicked off a flight. But this does suggest an interesting new form of civil-disobedience: suppose I find a government official with the same age and gender as me, and legally change my name to match theirs, if I start making contributions to radical Islamic groups, will they be prohibited from flying?

Update Jan 22:

More interesting information from the Aviation Security Senate Hearing courtesy of the amazing new GovTrack Insider service:

Hamil�ton said that the real chal�lenge is �un�der�stand�ing, man�ag�ing and in�te�grat�ing the vast amounts of data� that agen�cies must sort through. Leit�ner stat�ed that the NCTC puts more than 350 names on watch lists and looks at 30-40 pos�si�ble ter�ror�ist plots per day. The im�pres�sive amount of data that must be sort�ed and re�spond�ed to ef�fec�tive�ly is also ham�pered by tech�nol�o�gy, Leit�ner said in re�sponse to sev�er�al Sen�a�tors in�quiries on in�tel�li�gence ef�fec�tive�ness.

Leit�ner, vis�i�bly ag�i�tat�ed, ad�mit�ted that Ab�dull�mu�tal�lab�s name failed to come up in one query be�cause of a slight mis�spelling of his name in one database. Fur�ther�more, Leit�ner at�tempt�ed to clar�i�fy the con�fu�sion over databas�es and lists � in�clud�ing a �sec�ondary-screen�ing� list, a �no-fly� list and why Mr. Ab�dull�mu�tal�lab was on a list for fur�ther screen�ing once ar�rived at cus�toms, but not prior to step�ping on the plane. ��Al�though Mr. Ab�dul�mu�tal�lab was iden�ti�fied as a known or sus�pect�ed ter�ror�ist and en�tered into the Ter�ror�ist Iden�ti�ties Data�mart En�vi�ron�ment (TIDE)�and this in�for�ma�tion was in turn wide�ly avail�able through�out the In�tel�li�gence Com�mu�ni�ty�the deroga�to�ry in�for�ma�tion as�so�ci�at�ed with him did not meet the ex�ist�ing pol�i�cy stan�dards�those first adopt�ed in the sum�mer of 2008 and ul�ti�mate�ly pro�mul�gat�ed in Febru�ary 2009�for him to be �watch�list�ed,� let alone placed on the No Fly List or Se�lectee lists.� Leit�ner promised that these poli�cies were being re�cal�i�brat�ed.

Well it happened. One month after I started this post, it appears that a name-matching glitch allowed Umar Farouk Abdul Mutallab to board Northwest Airlines Flight 253. A missing space led to a name match error. Not too surprising, two (very common) name tokens combined into one would certainly be a hard case to catch even with fuzzy matching. And it seems likely that because they are adding 350 names per day, the DB is too big to do fuzzy name matches? Perhaps examples like this will, ahem, light a fire under our ass, to face up to the strengths and limitations of large-scale entity resolution. Maybe instead of trying to build a firewall around ourselves, we should focus on better ways of ending terrorism. Like good foreign policy.

One thought on “New years entity resolutions”

  1. Gee, wouldn’t it be simpler to just address and respond to the profound and legitimate concerns that make people so desperate they are willing to give their lives to disrupt oppression?

Leave a Reply

Your email address will not be published. Required fields are marked *