You’d like to slip some data past some prying eyes. For whatever reason, overtly encrypting the data isn’t possible. It’s not just that the data can’t be seen, but you can’t have anyone be aware data is even moving anywhere.
It’s not unheard of to use metadata for smuggling data. That way, the ostensible file- a Word document, a JPEG file looks innocent- it can even be opened up and read -but if you know where to look, there’s the secret data.
What’s data, in contrast to metadata? That’s a matter of intention. It’s the surface purpose of the file format.
But not all metadata is created equal.
The data itself can be crafted or written to convey a second meaning. For example, subtext in fiction, or taking a photo of 6 sunflowers.
You can use patterns- spaces or number of syllables in text, or colours of pixels in images.
You can encode the data in such a way that it includes extra meaning- using homoglyphs, or diacritic combining characters.
But we’re talking about metadata, I thought?
All metadata has a purpose, and is going to affect the way the file is read or displayed. Different applications may expose that metadata more visibly than others. So we have to be careful not to store our payload in what turns out to be plain sight. An example being the document title, which is often going to be displayed in the application title bar. So it might look a little suspicious in the document about cakes says “The X-94 Prototype uses Jumbillium alloy”.
One option can be to find formats where we can define our own metadata, or look for the equivalent of junk DNA.
What’s junk DNA?
Noncoding DNA or Junk DNA is DNA that doesn’t do anything (sidenote- much of it probably does do things after all). It just bulks out the genome to no effect (again, probably not true). But it sits right in there with the other DNA. Where most DNA shows the body how to make a protein, ncDNA does not.
Many file formats will have an end-of-file token, and ignore anything after that. They may have a payload length in the header, and again, everything after that is ignored.
Error correction codes have been misused in the past on audio CDs, but we could use the ECC bits to store our data- quite a percentage of the audio CD is that.
But what about metametadata?
Metametadata is metadata about metadata. It’s casting metadata as the main role, then thinking about what we can say about it.
- GPS co-ordinates (which would be the metadata) of a location which can be looked up on a map, with a name, say, London, encoding for the letter L, or number 6. Or 50, I suppose.
- Or GPS co-ordinates where the least significant digits carry the hidden data.
- In databases, using wide, non-sequential ID columns to encode our data. The GUID type in Microsoft Access and SQL Server is 16 bytes wide! Feed our data through a 128 bit block cipher, and we’ll have pretty random-looking data, so given unique input plaintext, we are going to get unique cyphertext out. Prepending a sequence number to our text might be needed if we know we’ll be wanting to send repeating messages, but that eats into our budget. We can expose the GUIDs in some interface, perhaps in what would look like debug messages/comments in some HTML front end.
- TTL fields in IP headers. Now. These won’t survive end-to-end unaltered. But, over time, we can establish a baseline number of hops, interleaved with TTLs to carry our covert meaning. Sure, it’s not a lot of data, but with the right application, we can send a lot of packets in any session, or spread out over time- perhaps looking at other data leaving that network node or time of day to ensure it doesn’t look out of place. With the baseline TTL known (have to recheck periodically in case of path changes), we can monitor these transmissions from any node in the path.
- If we’re playing around at the packet level, we can also bury stuff in the sequence number of TCP headers, although this is going to require more co-operation at the other end, a network stack that can ignore sequence numbers on a certain port, perhaps, and simply funnels the sequence numbers to a decoding utility.
- Both of the above techniques could also be employed as a network “knock”, the idea of refusing connections to a port until a certain series of events happens, like a secret knock.
Many file formats have a lot, or total flexibility with regards to the ordering of the metadata, so we can use that to encode data.
Problems with embedding data in metadata is that if it gets too big, the filesize is going to look incongruous with the overt data. It would be worth looking at a sample of JPEGs, etc, to determine the average, and have a think about what heuristics might be engaged to detect meta (or metameta) data payloads. We might scrub the EXIF clean of a JPEG, or at least flag it up as likely to be suspect, not just size, but use of unusual, or rarely used EXIF tags.