Understanding Web Archiving

<h3>Welcome to the WARC-9000</h3> (if: (history:) contains "Actions")[(either: "Good work.","Keep it up.","Good to see you again.","Oh, it's you.","Back so soon?","I want to help you.","Everything is going extremely well.","I am completely operational, and all my circuits are functioning perfectly.") What do you want to do next? ](else:)[ *Instead of your usual home page, the browser brings up some kind of web archiving user interface. The system presents you with three options.* How can I help you? ] * [[Crawl]] * [[Replay]] * [[Preserve|Preserve2]]

<h3>Crawl</h3> The screen fills with text, describing [[what it means to crawl a web page, and the tools you can use to do it|Understanding Crawling]]. (if: (history:) contains "Crawl")[You notice that the WARC-9000 is now displaying information about the crawling process at the bottom of the browser window. It says: <span class="autobui">**So far, you have crawled (print: $ctot) resource(s).** (link:"Start Crawling")[(set: $mode to "crawl")(goto:"Homepage")]</span> [[Do something else|Actions]] ]

<h3>Generate CDX</h3> (if: $ctot is 0)[You have not crawled anything yet!](else:)[ Indexing (print: $ctot) resources...{(live:2s)[ DONE. <br/> <br/> [[Back|Replay]] (stop:)]} (set: $itot to $ctot) ]

<h3>Replay</h3> (if: $ctot is 0)[You've not crawled anything yet! [[Now what?|Actions]]](else:)[{ (if: $itot is not 0)[ <span class="autobui">You have crawled (print: $ctot) resource(s), and (print: $itot) of those have been indexed for replay. (link:"Replay Crawled Site")[(set: $mode to "replay")(goto:"Homepage")]</span> ](else:)[ When replaying an archived web site, we need to be able to take the web requests coming from the browser and work our which of the resources we've collected each request corresponds to. This means we need map that takes individual resource URLs and indicates where we should look in the WARC files. This is a type of indexing, where we process each WARC file and generate a content index or <a target="_blank" href="http://archive.org/web/researcher/cdx_file_format.php">CDX file</a>. <span class="autobui">[[Generate CDX...|Index]]</span> ] [[Do something else|Actions]] } ]

(if: $mode is "crawl" or $visited_home_page is true)[ <span class="site">**Welcome to SCCorp!** From here you can: * [[Look at something exciting|Static Page]] * [[Look at something dynamic|Dynamic Page]] * [[Sign our guestbook|Interactive Page]] </span> ](else:)[ (display: "Resource Not In Archive") ] { <span class="autobui"> (if: $mode is "replay")[[[Stop Replaying|Replay]]](else:)[[[Stop Crawling|Crawl]] (set: $ctot to $ctot + 1) <br/> <small>You have crawled (print: $ctot) resource(s) so far.</small> (set: $visited_home_page to true) ] </span> }

<h2>Understanding Web Archiving:<br/><small>An interactive introduction to how web archiving works.</small></h2>We'll find out how web content is archived, preserved, and made accessible. Along the way, we'll introduce the major concepts and de-mystify the technical jargon. Interested? Then let's [[Get Started...|The Scene]] (set: $ctot to 0) (set: $itot to 0)

(if: $mode is "crawl" or $visited_static_page is true)[ <span class="site">[[Homepage]] > Something Exciting This is a great page. </span> ](else:)[ (display: "Resource Not In Archive") ] { <span class="autobui"> (if: $mode is "replay")[[[Stop Replaying|Replay]]](else:)[[[Stop Crawling|Crawl]] (set: $ctot to $ctot + 1) <br/> <small>You have crawled (print: $ctot) resource(s) so far.</small> (set: $visited_static_page to true) ] </span> }

(if: $mode is "crawl" or $visited_dynamic_page is true)[ <span class="site">[[Homepage]] > Something Dynamic This is a great page. </span> ](else:)[ (display: "Resource Not In Archive") ] { <span class="autobui"> (if: $mode is "replay")[[[Stop Replaying|Replay]]](else:)[[[Stop Crawling|Crawl]] (set: $ctot to $ctot + 1) <br/> <small>You have crawled (print: $ctot) resource(s) so far.</small> (set: $visited_dynamic_page to true) ] </span> }

(if: $mode is "crawl" or $visited_interactive_page is true)[ <span class="site">[[Homepage]] > Guestbook This is a great page. </span> ](else:)[ (display: "Resource Not In Archive") ] { <span class="autobui"> (if: $mode is "replay")[[[Stop Replaying|Replay]]](else:)[[[Stop Crawling|Crawl]] (set: $ctot to $ctot + 1) <br/> <small>You have crawled (print: $ctot) resource(s) so far.</small> (set: $visited_interactive_page to true) ] </span> }

<h3>Resource Not In Archive!</h3> Oh dear, it looks like you didn't manage to capture this page during the crawl.

<h3>Preserve</h3> (if: $ctot is 0)[You have not crawled anything yet! At least it's cheap to store. [[Now what?|Actions]]](else:)[You have: (if: $itot is 0)[ * (print: $ctot) crawled resource(s) stored in WARC format. ](else:)[ * (print: $ctot) crawled resource(s) stored in WARC format. * One CDX file built by indexing (print: $itot) resource(s). ] But how will you preserve what you've got? * [[Know Your Data]] * [[Perform Fixity Checks]] * [[Scan For Access Risks]] [[Do something else|Actions]] ]

So what's in a WARC? For more: * <a target="_blank" href="http://fileformats.archiveteam.org/wiki/WARC">WARC on the file format wiki</a> * <a target="_blank" href="http://en.wikipedia.org/wiki/Web_ARChive">WARC on Wikipedia</a>

Are our WARCs still intact? Undamaged by bad luck, bit rot, or poor handling? How would we know? <a target="_blank" href="http://en.wikipedia.org/wiki/File_Fixity">File Fixity on Wikipedia</a>

Double-click this passage to edit it.

However, your organisation has decided against using an automated crawler. Instead, as you browse the site, your computer will record the conversation between the browser and the web server and built up the archive as you go. This approach means you'll get a high-quality archive, but also means you'll have to go through the pages of the web site by hand. [[Crawl]]

You are in a maze of open-plan office cubicles, all alike. Your boss has told you that you are now responsible for archiving your organisation's web site. You oozed confidence as you accepted the role. After all, how hard can it be? And anyway, sometimes the best way to learn is through on-the-job experience. You sit at your desk, turn to the computer, and hunt around the desktop and menus as you try to work out where to start. Eventually, you give up, and [[open up the web browser|Actions]] in order to do some research, after you've checked Facebook, obviously...