Web Archiving Analysis Workshop

Ian Milligan

University of Waterloo

Follow along at https://ianmilligan1.github.io/webarchive-workshop/
In the workshop, let's make sure to run our downloads right now -- so they're ready when we need them!

* Download and install Gephi: https://gephi.org

* Download and install Docker: https://www.docker.com/

Follow these instructions to install Docker.

Part One: Accessing the Archived Web (Responsibly)

Let's go to the Internet Archive!

Wayback Machine

Let's do a keyword search for "Simon Fraser University."


Note the drift.

Exercise #1

Try to find some sites of interest. Do you see any gaps? Why were sites collected? Do you find any temporal violations?

Can you find any sites that may have never existed? (spooky)

Part Two: Conceieving a Research Question

Let's look at a few web research approaches first.

Voyant Tools.

Network Analysis.

Looking at pages!

Exercise #2

Let’s now begin to think about what we could do with web archives. Try to think of a simple research question you could explore with five to ten websites from the Web, that involve the following dimensions:

  • Using plain text
  • Using hyperlinks
  • Visual layout (using Wayback Machine)

Take a few minutes to write down your ideas.

Part Three: Getting Web Archive Files

Always check for existing collections!

Wayback Machine powered by WARC files?

Example Collection: Canadian Political Parties at University of Toronto!

Example Collection: University of Toronto Collections in General!

Example Collection: Simon Fraser University!

Archive-It Research Services

Archives Unleashed Project!

Talk to your librarian, or the librarian who controls a collection! They probably want you to be able to build on their hard work.

Exercise #3

Start exploring the Archive-It page!

  • What sorts of collections can you find in Canada?
  • What sorts of collections can you find elsewhere?
  • Can you imagine using any of these in your research?

Take 5-10 minutes to explore, and then we'll have a quick discussion.

Part Four: WASAPI for Fun and Profit

Guest Star: Nick Ruest!


Java WASAPI Downloader

Python WASAPI Downloader

Part Five: Rolling Your Own Web Archive

Sometimes you have to make your own web archive!


Heritrix is a bit hard to run...

You can also use Wget

But it's not a walk in the park either...
You can use WebRecorder.io!
It is a lot of fun, and pretty easy to use.


Exercise #4

Use WebRecorder.io to grab the following two things:

Take about 5 or 10 minutes to get familiar with the interface. When you're done, we'll begin thinking about our own collection...

Exercse #5

Remember those five to ten websites I asked you to write down in Exercise #2?

You guessed it! Now begin to crawl this content with WebRecorder.io. Please try to make sure you don't grab more than 200-300MB (just for WiFi problems

When you are done, download the WARCs.

Part Six: Unleashing Archives with the Archives Unleashed Toolkit

You all installed Docker and Gephi right?

If not.. quietly do so!

Make sure Docker is running.
This is all really Exercise #6, as we will be hands on the whole time!

Create a directory, by default on your desktop.

If you downloaded your WARCs, please put them in this new directory.

Now open up your terminal window.

We now need to run the following command. You need to replace /path/to/your/data with the directory you just made.

Don't worry! We'll be here to help.

							docker run --rm -it -v "/path/to/your/data:/data" archivesunleashed/docker-aut

Remember, replace /path/to/your/data with your own path. On my system, it is:

							docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut

Once it's working, you should see:

							Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.



Hello World: Our First Script!

Now that we are at the prompt, let's get used to running commands. The easiest way to use the Spark Shell is to copy and paste scripts that you've written somewhere else in.

At the scala> prompt, type the following command and press enter.


Now, cut and paste the following command:

import io.archivesunleashed.spark.matchbox._
import io.archivesunleashed.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
.map(r => ExtractDomain(r.getUrl))

Let's take a moment to look at this script! It:

  • begins by importing the AUT libraries;
  • tells the program where it can find the data (in this case, the sample data that we have included in this Docker image);
  • tells it only to keep the "valid" pages, in this case HTML data;
  • tells it to ExtractDomain, or find the base domain of each URL - i.e. www.google.com/cats we are interested just in the domain, or www.google.com;
  • count them - how many times does www.google.com appear in this collection, for example;
  • and display the top ten!

Now it is pasted in, let's run it!

Hold CTRL and D at the same time.

You should see:

					// Exiting paste mode, now interpreting.

import io.archivesunleashed.spark.matchbox._
import io.archivesunleashed.spark.rdd.RecordRDD._
r: Array[(String, Int)] = Array((www.equalvoice.ca,4644), (www.liberal.ca,1968), (greenparty.ca,732), (www.policyalternatives.ca,601), (www.fairvote.ca,465), (www.ndp.ca,417), (www.davidsuzuki.org,396), (www.canadiancrc.com,90), (www.gca.ca,40), (communist-party.ca,39))


We do this example to do two things:

  • It is fairly simple and lets us know that AUT is working;
  • and it tells us what we can expect to find in the web archives! In this case, we have a lot of the Liberal Party of Canada, Equal Voice Canada, and the Green Party of Canada.

Now let's try with your own data! To do so, we take this script and substitute in a new directory. Remember to type in :paste and Ctrl+D to run it.

import io.archivesunleashed.spark.matchbox._
import io.archivesunleashed.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/data/*.gz", sc)
.map(r => ExtractDomain(r.getUrl))

Extracting Text

Now that we know what we might find in a web archive, let us try extracting some text. You might want to get just the text of a given website or domain, for example.

Above we learned that the Liberal Party of Canada's website has 1,968 captures in the sample files we provided. Let's try to just extract that text.

To load this script, remember: type paste, copy-and-paste it into the shell, and then hold ctrl and D at the same time.

import io.archivesunleashed.spark.matchbox.{RemoveHTML, RecordLoader}
import io.archivesunleashed.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))

If you're using your own data, that's why the domain count was key! Swap out the "www.liberal.ca" command above with the domain that you want to look at from your own data.

Now let's look at the ensuing data. Go to the folder you provided in the very first startup – remember, in my case it was /users/ianmilligan1/desktop/data - and you will now have a folder called liberal-party-text. Open up the files with your text editor and check it out!

Once you open up your `liberal-party-text` folder, you should see three files:

  • part-00000
  • part-00001

The `_SUCCESS` file is just there to tell you that it worked!

The two part files contain the results from the two files. Open them up with a text editor!

Just the Text!

You might just want to generate plain text. You could try the following.

import io.archivesunleashed.spark.matchbox.{RemoveHTML, RecordLoader}
import io.archivesunleashed.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .map(r => (RemoveHTML(r.getContentString)))

This way you could just paste the results into somewhere for analysis, and not have the extra information like domains, dates, etc.

For example, you could paste this data into Voyant Tools!

Ouch! Our First Error!

One of the vexing parts of this interface is that it creates output directories – and if the directory already exists, it comes tumbling down.

As this is one of the most common errors, let's see it and then learn how to get around it.

Try running the exact same script that you did above.

import io.archivesunleashed.spark.matchbox.{RemoveHTML, RecordLoader}
import io.archivesunleashed.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))

Instead of a nice crisp feeling of success, you will see a long dump of text beginning with:

org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/data/liberal-party-text already exists

To get around this, you can do two things:

  • Delete the existing directory that you created;
  • Change the name of the output file - to /data/liberal-party-text-2 for example.

Good luck!

Other Text Analysis Filters

Take some time to explore the various options and variables that you can swap in and around the .keepDomains line. Check out the documentation for some ideas.

Keep URL Patterns

Instead of domains, what if you wanted to have text relating to just a certain pattern? Substitute

for a command like:


Filter by Date

What if we just wanted data from 2005, or 2008? Look for

and then add the following line below it:

.keepDate(List("2005"), YYYY)

Filter by Language

What if you just want French-language pages? After

add a new line:


For example, if we just wanted French-language Liberal pages, we would run:

import io.archivesunleashed.spark.matchbox.{RemoveHTML, RecordLoader}
import io.archivesunleashed.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))

People, Places, and Things: Entities Ahoy!

One last thing we can do with text is to try to use Named-entity recognition (NER) to try to find people, organizations, and locations within the text.

To do this, we need to have a classifier - luckily, we have included an English-language one from the Stanford NER project in this Docker image!

import io.archivesunleashed.spark.matchbox.ExtractEntities

ExtractEntities.extractFromRecords("/aut-resources/NER/english.all.3class.distsim.crf.ser.gz", "/aut-resources/Sample-Data/*.gz", "/data/ner-output/", sc)

This will take a fair amount of time. Good excuse for a coffee break and to stretch!

When it is done, in the /data file you will have results. The first line should look like:

					(20060622,http://www.gca.ca/indexcms/?organizations&orgid=27,{"PERSON":["Marie"],"ORGANIZATION":["Green Communities Canada","Green Communities Canada News and Events Our Programs Join Green Communities Canada Downloads Privacy Policy Site Map GCA Clean North Kathie Brosemer"],"LOCATION":["St. E. Sault","Canada"]})

Part Seven: Network Analysis

One other thing we can do is a network analysis. By now you are probably getting good at running code.

Let's extract all of the links from the sample data and export them to a file format that the popular network analysis program Gephi can use.

import io.archivesunleashed.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader, WriteGEXF}
import io.archivesunleashed.spark.rdd.RecordRDD._

val links = RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .filter(r => r._2 > 5)

WriteGEXF(links, "/data/links-for-gephi.gexf")

Now let's use Gephi! We'll do this together.

Let's open up Gephi.

Select "open a Graph File" and select the file that you just generated.

Click "OK" and then "Create a New Graph."

If you see the Borg, you're looking good!

Now go to "Data Laboratory," "Nodes," and "Copy Data to Another Column." Select ID and then copy it to "label."

Go back to "overview."

We'll then "FILTER" the graph using "Yifan Hu," and then set up rankings.

Do you want different data to work with?


The rest we can do hands on - it is getting more difficult to explain!

Or you can follow along here.

Thanks for your time today!

This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Smart Start Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.