Code Buckets

Buckets of code

PowerShell

Sorting Unknown Images With PowerShell

The Problem

You’ve got a large amount of binary files. Some are images but you’ve no idea which ones. Some will be gifs, some jpegs, some bmp and other strange formats. They’ve been dropped on you without file extensions. Perhaps they’ve been extracted from a database blob field. Perhaps they have partially retrieved from some backup tapes after a system crash. Perhaps they have been under earthed in an Anglo-Saxon burial mound just outside of Norfolk. However they arrived, it is now your task to sort them by file type.

The Solution

The broad principle here is that image types are identifiable from their first few bytes. So in hex

  • Jpg starts with “FFD8”
  • gif starts with “474946”
  • bmp starts with “424D”
  • png starts with “89504E470D0A1A0A”

We could use any programming language we choose to sort the images on this basis. I’m going to use PowerShell because

  1. This seems to me like a dev ops type of activity. PowerShell scripts are easy to hook into and run in continuous build and the like
  2. I don’t need to write any kind of UI
  3. I’m practising my PowerShell and trying to get better (the real reason)

The Script

This is the entire script with explanatory comments

param
(
 [string]$FilePath = "C:\Users\tbrown\Pictures"
)

# dictionary with the image identifiers
$images = @{"jpg" = "FFD8"; gif = "474946"; "bmp" = "424D"; png = "89504E470D0A1A0A"}

# Get all files (not directories) under a given path
Get-ChildItem $FilePath | ? { !$_.PSIsContainer } | % {

 $ImageFilePath = $_.FullName
 $FileHeader = ""

 # Get the first 8 bytes of the file as a hex string
 Get-Content $ImageFilePath -First 8 -Encoding Byte | % {
  $FileHeader = $FileHeader + $_.ToString("X2")
 }

 # test each image type in the dictionary
 $images.GetEnumerator() | % {

  if($FileHeader.StartsWith($_.Value))
  {
   # we identified the file type. Create a directory if needed and move
   Write-Host($ImageFilePath + " is a " + $_.Key)
   $ImageDirectory = Join-Path $FilePath $_.Key
   New-Item -ItemType Directory -Force -Path $ImageDirectory
   Move-Item $ImageFilePath $ImageDirectory
  }
 }
}

The core part of the script is

Get-Content $ImageFilePath -First 8 -Encoding Byte | % {
  $FileHeader = $FileHeader + $_.ToString("X2")
 }

Get-Content gets the content of the file, in this case the first 8 bytes. Each byte is then iterated through and changed into a hexadecimal string (ToString(“X2”)). This is appended to the $FileHeader variable which we use to compare against the known image headers. This allows us to identify which type of image this is. The rest of the script is moving the files around and sorting them into different directories.

If the script is saved into a file e.g. ImageSorter.ps1 it can then be run with the dot sourcing command.

.\ImageSorter.ps1 –FilePath “C:\MyFilePath”

So that’s it, image sorting in a nutshell. Hopefully the above script will be useful for someone, somewhere at some point. Happy unknown image sorting everyone.

Useful Links

File signatures

https://en.wikipedia.org/wiki/List_of_file_signatures
A useful, easy to read list of file signatures which could be used in identifying unknown files. Most are headers but some have an offset which is given. The above script could easily be extended to account for alternative types using this list.

http://www.fileformat.info/ gives a wealth of information on file structure and formats. For the (very) interested here is a detailed breakdown of bmp, gif, jpeg and png files including header information.

http://www.fileformat.info/format/bmp/corion.htm
http://www.fileformat.info/format/jpeg/egff.htm
http://www.fileformat.info/format/gif/egff.htm
http://www.fileformat.info/format/png/egff.htm

Other file types that I didn’t implement such as tiffs  are also detailed.

Alternative implementations

This Stack Overflow answer is a C# implementation of an image sorter should anyone require it. I did pinch the file header information from here (easier than fileformat.info) but the rest is my very own work crafted by my very own coding fingers – promise.

1 COMMENTS

LEAVE A RESPONSE

Your email address will not be published. Required fields are marked *