Text Fingerprinting Update
Stories and ideas from readers
Journalists watch out—unintentionally revealing sources is less clever and more widespread than first reported.
Two days ago I posted an article on fingerprinting with zero-width and other characters. It garnered a number of stories and comments from readers, so I decided to post this quick update.
I used to work for members of the US Senate and House of Representatives, and when it came to unreleased legislative text there was a more primitive method of doing this if you were worried about keeping it under wraps. Intentional typos and small changes to the text that most people wouldn’t notice.
That way, if the text got to a lobbyist or to the press, you could tell which office did the leaking, because every office was given a slightly different text with different fingerprints.
- Jim Swift of The Weekly Standard.
Misspelling is quite a bit easier than relying on synonyms, but let’s evaluate my proposed countermeasures against this method:
- Avoid releasing excerpts and raw documents. Works perfectly.
- Get the same documents from multiple leakers to ensure they have the exact same content on a byte-by-byte level. Works perfectly.
- Manually retype excerpts to avoid invisible characters and homoglyphs. Works unless careless.
- Keep excerpts short to limit the amount of information shared. Works unless unlucky.
- Use a tool that strips non-whitelisted characters from text before sharing it with others. Doesn’t work.
What I like about this story is that I hadn’t even thought of this fingerprinting vector, but most of my proposed countermeasures work and they work in the order that I ventured would be most to least secure. Swift’s method is a classic canary trap, a method employed by spies and mapmakers since time immemorial.
So I saw this post and tried to create a script to combat the issue. I’m sure there’s an easier, 1 line, awk/sed solution, but I figured this could be a bit more robust and have more features. I’m open to taking suggestions/additions if anyone has any.
https://github.com/DavidJacobson/SafeText
I want to work on it until it has a sort of “profiling” feature, where it tries to guess the author’s background based off language + characters, and more (need to read more first.)
- David Jacobson
While I would caution people to rely on these types of scripts as their sole protective measure against text fingerprinting, realistically journalists need to email around leaks to vet their sources and an open source project to sanitize what’s possible is better than nothing.
It’s a bit like journalists that use Signal to communicate. It isn’t NSA proof, but it’s a whole lot better than nothing. Your local blackhat for hire can’t crack it.
Of course I can’t personally vouch for Jacobson’s code or character, but most programmers are good, so he can probably be trusted.
A few readers contacted me to bring to my attention various other characters that can be employed for this method, including the en space, figure space, and others. Some reported seeing these used in the wild, though they declined to share their name publicly.
If you’ve found instances of fingerprinting in documents sent to your news organization and would like to include my comment in an article on the topic, please feel free to reach out.
You may also want to read something else I've written.