Format Identification#
subtitle: What’s in the box?
Understanding your digital collections.
I think this is really an experiment?
Ideas#
ML and confidence and my prior work on ID based on punctuation plus co-location? plus whole-words-or-tokens note that char-type-coloc could be shared publicly as no actual data remains and UI and a standard for format data, like a BagIt thing