Is digitisation on topic?
There have already been a couple of questions focused on digitisation
rather than digital preservation. I would assume that these are actually
off topic and should be closed? The boundary is of course a little
fuzzy.
I would propose that this community should be involved in decisions
related to the results of a digitisation initiative. For example, the
file formats and metadata schemas used, where/how the results are
stored, and so on. However, questions focused on how to engage in
digitisation, what types of scanners should be used or resolution to
digitise at, will be off topic.
This
question
is about OCR, although is phrased in the title as a more general
digitisation question. Is this in scope? Its a tough call.
As many of us in the digitisation community are well aware, confusing
digital preservation for digitisation is a common mistake. I'd suggest
adding some clear scoping detail on this to the "What kind of questions
should I not ask here?" section of the FAQ. Examples may well be
necessary to keep things clear.
Paul Wheatley
Comments
- Andy Jackson: Late to the party, but just wanted to say that I think we should not be
surprised at the confusion. That Digital Preservation means Preserving
Things Digital rather than Preserving Things Digitally is pretty
confusing, and we should expect to help newcomers disambiguate the two.
- Paul Wheatley: We could definitely use a migration path to Libraries and Information
Science stack exchange, so any questions closed due to scope can be
pushed in that direction.
Answer by Donald.McLean
(updated to reflect a more careful consideration of the subject)
It occurred to me that the answer should be a conditional yes, rather
than an unqualified yes.
If someone is just interested in digitizing audio, video, or
photographs, they should probably be asking in Audio and Video
Production or Photography.
If someone is asking about digitizing as part of a digital preservation
project or with the intention of digitally preserving their material,
then I think that it should definitely be on-topic.
It seems to me that if someone wants to digitally preserve something,
and they have a legitimate question about the digitization process,
especially if it relates to the preservation aspect, then we shouldn't
be turning them away. They're a legitimate part of the constituency of
this site and we should be taking their questions seriously.
Comments
- Paul Wheatley: I would argue that digitisation is not digital preservation, and both
wikipedia and the
experts
agree with this assertion. Hence digitisation is out of scope. Or the
title of this site is wrong.
- Donald.McLean: I didn't argue that digitization was digital preservation. I argue that
it is a closely related topic. Fanatical adherence to a very narrow
topic set isn't healthy for the site.
- Paul Wheatley: I don't think this is fanatical adherance. We're talking about two
related but very different topics. Confusing the two would not be a
positive move for a best practice site. If it was called "Digital
Preservation and Digitisation" I would not have an issue.
- Donald.McLean: Without digitization, there wouldn't be any digital content to preserve,
so I hardly think that ruling it as off topic is either wise or
productive.
- Paul Wheatley: I strongly disagree with that statement. Much of the digital information
created today is *born* digital. If digitisation is done correctly,
its digital preservation should be trivial. The main digital
preservation challenge lies with born digital data. This is why its very
important to avoid confusion with digitisation. Its a different process.
- Donald.McLean: Born digital content may be the norm for some companies and some
industries, but there's a big wide world out there, full of folks with
different issues and different situations. This site is for all of those
people, not just those wrestling with problems inherent to born digital
data. I still haven't seen a single argument that would convince me why
such a closely related topic should be excluded.
- Paul Wheatley: *:-) I haven't seen an argument that would convince me why every
digitisation question should be included in a digital preservation Q+A
site. I'll bow out of this one for now and see what other people think,
as this should be a community decision.
- Donald.McLean: I agree - it is a community decision. I'm just trying to point out what
I feel are important points about how questions like this will affect
the long-term health of the site. A certain amount of tolerance and
inclusionism will help build a strong community where blind adherence to
some arbitrary guidelines won't.
- John Lovejoy: I am a firm believer that digital preservation DOES NOT involved
digitisation. Digitisation is a process to create digital artifacts,
sure, but the long term preservation of those artifacts is a totally
different area, with a totally different skillset required.
- Paul Wheatley: A further thought I had on this: Digitisation remains in scope on the
Libraries and Information Science
SE,
where many from the digital preservation community have also been asking
questions about digital
preservation
in lieu of this SE going live.
Answer by Courtney C. Mumma
I agree that the mechanics of digitization are not in scope of this
site, but think that the preservation of the results are well within
scope, and that includes the formats that one creates in the course of
digitization projects since those formats may determine the course of
preservation activities.
Comments
Answer by warren
If you're not digitizing, how can you be doing digital preservation?
Seems like it has to be on-topic.
Comments
- Donald.McLean: My thoughts as well, but said more succinctly.
- walker: To be clear, I believe by 'digitisation' Paul means the process of
creating a digital representation of an analog object. Therefore one
could easily have 'digital preservation' - __ensuring access to a
digital object over time__ - without digitization, just by working
with **born-digital objects**: tweets, images from digital cameras,
Word documents, web pages, Pixar movie components, etc. Nevertheless I
personally think digitization would be a candidate topic here.
- Adam-E: I agree, though one might argue that the content would have been born
digital and how to preserve this rather than preservation by digitising.
- Adam-E: Ah, Looks like @walker beat me to the punch on my last comment :)
- warren: @Adam-E - given that not all artifacts are "born digital" ... it stills
seems that digitization must be on-topic
- gmcgath: If digitizing is automatically on topic solely because it's a
prerequisite to digital preservation (although it isn't, as has already
been pointed out a couple of times), then wouldn't everything from
booting up a computer onward have to be considered "on topic"? I agree
that digitization is on topic when it comes up in a preservation
context, but not unconditionally.
Answer by wizzard0
I think digitization of analog objects with the goal of digitally
preserving them, or keeping them easily-preservable in the future is on
topic.
Digitization by itself, however, may better be discussed elsewhere.
Comments
Answer by Nicholas Webb
As Donald and Courtney have pointed out, there's a continuum of
relevance here. "What's the best scanner to use" and "what DPI should I
scan at" seem obviously out of scope, "should I save as TIFF or JPG"
less so, "how should I organize and describe the files once I've scanned
them" clearly on-topic.
My experience on archives listservs and discussion groups is that basic
scanning guidelines are a perennial topic of inquiry from small LAM
institutions trying to bootstrap a digitization program. Perhaps the
answer is to link to some good introductory digital imaging resources
from the FAQ, with a newbie-friendly explanation of why these types of
questions are out of scope.
A distinction that might be useful: is the question about creating an
authentic representation of an analog object ("will this file format
capture all the details of my book/manuscript/photograph?"), or is it
about ensuring the sustainability of the resulting digital object ("will
my institution be able to read this file format in ten years?"). This
might be a workable guideline for determining relevance.
Comments
- Paul Wheatley: I like this approach. I don't want to avoid helping people out with
"which scanner, what DPI" type questions, but I do want to avoid having
the site swamped with low tech digitisation issues, which is likely to
put off a lot of digital preservation experts, and dilute the quality of
this forum. Having a bit of detail and some good references in the FAQ,
could, as you suggest, be a win-win.
- Donald.McLean: It has been my experience that many first time posters do not read the
FAQ before asking questions. The important thing to remember about this
is that a snarky, elitist comment about how the question is clearly off
topic and they should have read the FAQ first WILL NOT HELP. As long as
everyone is clear on the fact that even people whose only shortcoming is
failing to read the FAQ first should still be treated in a friendly and
welcoming manner, I think this is a policy I can get behind.
- Nicholas Webb: Cosign Donald's comment. In my experience consulting with local
historical societies, many longtime archivists/curators of analog
material have a vague idea that digitization and digital preservation
are things they need to be doing, but if they aren't technologists or at
least power users themselves, getting them up to speed can require a lot
of handholding on basic technical topics. This stack isn't the place for
those questions, but given that it's likely to get them, we should
encourage a culture of politeness/friendliness in referring them
elsewhere.
Answer by Ross Spencer
Yes.
The phrasing is key however and it should be focused on the results of
digitization - the formats, the metadata associated with it, the storage
etc. and beyond that, errors that appear in the data stream.
I would be interested in seeing questions about the digitization
process, Sean Martin's work at the BL for example comes to mind re: the
difference lenses make between scans of the same document in the same
position etc. Also DPI questions. These are questions that affect the
value of the digital data being created and impact whether or not this
data becomes an asset that we do want to preserve.
It might not start as a digital preservation question but it will become
one. Best to start with preservation in mind. Some records are not kept
after digitization - that makes these new digital records quasi-born
digital.
The compromise would be to discuss just the results of the digitization
process.
Incidentally in my mind it is clear the OCR question referenced in the
OP is NOT a digital preservation question. It belongs on the libraries
Stack Exchange. My rationale is simple - OCR can be done over and over
again. The OCR output is data associated with the digital object we're
interested in, almost a second-layer of metadata. The error is an
artifact of the current OCR process that doesn't impact preservation.
A few naive OCR DP questions (not having handled OCR much in the past)
might be:
- What standard formats are there for storing OCR data?\
- What long term metrics should I maintain about my OCR that will be
useful to future users of the data?\
- How do I tie my OCR data to my digital images for long term access?
Those may not be perfect so happy to discuss in the comments below, I'm
eager to find the boundary here too.
I think the site has to be carefully moderated. It is a Q&A site so
ensuring the same digitization question isn't repeated over and over
will be important and moderators, maybe via the discussion forum should
have a clear FAQ about what constitutes an acceptable digitization
question before closing anything.
Hope that helps.
Comments
Answer by Bill Lefurgy
Let's think about it in narrow terms as creation of digital content from
a preservation perspective. Questions about optimal design of a
prospective website in connection with formats, metadata and
architecture presumably would be in scope.
Someone undertaking digitization could ask the same kinds of questions,
as long as they were focused on how best to create preservable output.
The same could be true of any digital creation activity--photographs,
videos, data sets and so on. Anything that doesn't relate to creating
content for clear-cut preservation purposes (equipment, throughput, QC)
would be out of scope.
Comments