MS Office Document Data-Hiding/Exfiltration
I was toying around with an MS Word document for malware analysis and found a cool vector for exfiltrating data. Newer Office documents are constructed as a collection of XML files inside a compressed directory structure. Observe the “PK” header:
hexdump of a .docx file
If you change the file-type from a .docx to a .zip, you can unpack the file structure. In Powershell, you can use the built-in module “Expand-Archive” if you change the file extension from .docx to .zip. This can be bypassed if you use the [System.IO.Compression.ZipFile]::ExtractToDirectory() assembly and method. Below is an example. Notice the first attempt fails because Expand-Archive checks the file extension (even though it IS a PK file).
PS D:\powershell\doctest> Expand-Archive .\testdocx.docx
Expand-Archive : .docx is not a supported archive file format. .zip is the only supported archive file format.
At line:1 char:1
+ Expand-Archive .\testdocx.docx
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (.docx:String) [Expand-Archive], IOException
+ FullyQualifiedErrorId : NotSupportedArchiveFileExtension,Expand-Archive
PS D:\powershell\doctest> mv .\testdocx.docx .\testdocx.zip
PS D:\powershell\doctest> Expand-Archive .\testdocx.docx
Expand-Archive : The path '.\testdocx.docx' either does not exist or is not a valid file system path.
At line:1 char:1
+ Expand-Archive .\testdocx.docx
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (.\testdocx.docx:String) [Expand-Archive], InvalidOperationException
+ FullyQualifiedErrorId : ArchiveCmdletPathNotFound,Expand-Archive
PS D:\powershell\doctest> Expand-Archive .\testdocx.zip
PS D:\powershell\doctest> ls
Directory: D:\powershell\doctest
Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 10/23/2020 12:06 PM testdocx -a---- 10/23/2020 12:05 PM 13100 testdocx.zip
To start, I automate the creation of an MS Word document with the text “Only 16 Bytes”:
$DestFile = New-Item $(Join-Path $PWD -ChildPath "testdocx.docx")
$WordObject = New-Object -ComObject word.application
$WordObject.Documents.Open($DestFile.FullName)
$WordObject.Visible = $False
$TextInput = $WordObject.Selection
$Null = $TextInput.TypeText("Only 16 Bytes :)")
$Null = $TextInput.TypeParagraph()
$WordObject.Documents.Save()
$WordObject.Documents.Close($false)
$WordObject.Quit()
Viewing the document contents confirms it worked:
PS D:\powershell\doctest> ls Directory: D:\powershell\doctest
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 10/23/2020 12:05 PM 13100 testdocx.docx
Notice that the file size, even though only 16 bytes were put in the document, is much larger. This is due to the XML structure that makes up the document.
Next, the document is unpacked as previously shown. Looking into the directory structure, there a bunch of XML files that winword.exe uses to represent the file in the UI when opening the document. Since this is a small document not containing much formatting, the files are limited. If we had a lot of textual effects, there would be more XML files. Which leads to the next (fun) part..
Here I take psexec, read it into a byte array, then convert it to a base 64 string. You can xor the string, zip, then xor the strings, etc. to obfuscate it if you want, but for this simple case, I’ll just use the raw string, then dump it to an XML file named “item2.xml” and copy it into the directory structure under customXML. Since there’s already an item1.xml, this wouldn’t be suspicious. If we wanted to take it a step further, you could format it to represent a real XML document.
PS D:\powershell\doctest> $FileBytes = [System.IO.File]::ReadAllBytes($(ls .\PsExec.exe).FullName)
PS D:\powershell\doctest> $Base64 = [convert]::ToBase64String($FileBytes)
PS D:\powershell\doctest> echo $Base64 > item2.xml
PS D:\powershell\doctest> cp .\item2.xml .\testdocx\customXml
PS D:\powershell\doctest> ls .\testdocx\customXml
Directory: D:\powershell\doctest\testdocx\customXml
Mode LastWriteTime Length Name
---- ------------- ------ ----
d----- 10/23/2020 12:29 PM _rels
-a---- 1/1/1980 12:00 AM 241 item1.xml
-a---- 10/23/2020 12:46 PM 904262 item2.xml
-a---- 1/1/1980 12:00 AM 341 itemProps1.xml
There, all that’s left to do is pack it back up and ship it out (or in).
For this step, you have to be mindful of only packing the top-level directories. Don’t copy the directories into one directory, then attempt to compress (zip) that directory. That’s not the format Word expects and will error out when attempting to open it.
I ran into an issue attempting to do it from the command line. I’m guessing due to the wrong compression-type. Something I’ll circle back to when I get the chance. So, instead I did it manually from explorer.
In explorer, select all of the files, Right-Click -> Send To -> Compressed file.
Rename the file from a .zip to a docx and doneski. Notice the file size difference, but the contents viewable inside the doc remain the same.
PS D:\powershell\doctest> ls .\exfil.docx
Directory: D:\powershell\doctest
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 10/23/2020 1:22 PM 277826 exfil.docx
I’m not sure if this being done in the wild after a cursory search. But it could be a daunting and costly vector to detect. I’m not on the up and up with modern detection mechanisms that are dedicated to the inspection of .docx structures, but I would venture a guess that this isn’t caught by a lot of things. I’ve gotten this to work through Google with no issue.