You are here:

Web archiving at Sound & Vision: outcomes of our NTR pilot

 

In 2013, Sound & Vision’s R&D department teamed up with Dutch public broadcaster NTR in a pilot project to archive four of their websites. As one of the largest audiovisual archives in Europe with a collection of 800,000 hours, Sound & Vision is interested in archiving websites, since in recent years broadcasts are increasingly accompanied by in-depth websites that provide additional content. Thus, websites and broadcast complement each other, and Sound & Vision is therefore interested in archiving these websites to contextualize its audiovisual collection. We started to explore the topic in 2008 in a series of research projects, most notably the EU project Living Web Archives (LiWA). When Dutch public broadcaster NTR, which makes programmes with an educational and cultural perspective, NTR contacted Sound & Vision in to safeguard four of its websites, the R&D department started a web archiving pilot project to actively start using the research project knowledge in practice. The results are now publicly available in our online web archive, and in this blog post, we’ll describe how we realised it.

The websites

The following four websites were archived witin the scope the pilot:

  • NovaTV: NOVA/Den Haag Vandaag was a current affairs programme aired from 1992-2010. The website consists of specific files and reports on topics discussed in the programme.
  • PREMtime: PREMtime was a radio talk show (2010-2012), with public and expert opinions on topics mainly related to politics and cultural diversity
  • SchoolTV Plein: This NTR portal was aimed at primary school children, and was selected to be taken offline, since it would be replaced. Being an overall archive of all the material used in NTR’s general educational platform SchoolTV, archiving it supports editorial teams that want to reuse the content.
  • Verre Verwanten: In Verre verwanten, Dutch media personalities trace back their genealogical roots. The programme ran from 2005-2008, and its accompanying website contains broadcasts and the genealogical and historical research information gathered by the programme’s editors.

Screenshots of the archived NTR websites.

Archiving the websites

Our crawling partner Internet Memory Research (IMR) archived the websites, after which Sound & Vision and NTR did the quality assurance. If anomalies or missing content were detected, this was reported to IMR, and their engineers tried to recapture this content if possible. For example, sometimes images were not hosted on the URL of the archived website, but on another NTR domain. IMR then had to archive those images from the other domain, and add them to the web archive. Some issues could however not (yet) be resolved, or have been very difficult to patch. One major problem is content that is dynamically generated through POST requests or JavaScript code (read more about this on the UK Web Archive blog). Another common issue in web archiving projects archiving audiovisual content. Very often, this av-content is hosted on protected servers, and various complex streaming protocols are used, so web crawlers cannot simply access and harvest the content. IMR used a specific application for downloading video content, although still a thorough understanding of each website, video protocols and storage location of the videos is needed to capture it.

Providing access

We also wanted to provide fast and attractive access to the web archive, that would not just represent the pilot with the NTR, but which would also form the basis for our future web archiving plans. Thus, we also held a user requirements session with prospective end users and stakeholders from Sound & Vision, NTR and the developers from Frontwise and Dispectu. The main outcomes were that users expect full-text search, want to filter search results on e.g. website and time period, want to compare different crawls from the same website and in general want to be able to see quite a lot of information about the crawl itself (such as the archiving date and the original URL) and the contents of the entire web archive. With these outcomes in mind, development was started: Frontwise developed the front-end, and Dispectu built the back-end and made a full-text index of the web archive. In total three main pages were developed: the home page, the search page and the browse page in which users can see the archived website.

Search result page of the Sound & Vision web archive.

Lessons-learned and future work

The pilot we did with the NTR provided us with many insights into archiving websites of public broadcasters [1]. First of all, they obviously contain a lot of av-content which is difficult to archive. Secondly, as is the case with websites in general, these websites are using more and more JavaScript and other highly dynamic techniques. Current web archival techniques are simply lagging behind these developments, and we need to catch up in order to safeguard modern websites. When developing the front-end and back-end we learned that creating a clean interface that pleases everyone, and indexing even this quite small sample of websites is not an easy feat. Furthermore, a web archive is still a relative novelty for many internet users, and it’s hard for those unfamiliar with them to understand their context and limitations. More generally, the NTR pilot and the other R&D web archiving projects have resulted in insights which has shaped Sound & Vision’s collection policy and vision on the topic. As a result, this year the first steps will be taken to structurally create a web archive related to Sound & Vision’s collection, consisting of broadcaster and non-broadcaster material. For now, you can find the NovaTV, PREMtime and Verre Verwanten in our web archive. Please let us know if you have any questions or comments: webarchief [at] beeldengeluid.nl. We’d love to hear from you!

Further reading

[1] The definitive, peer reviewed and edited version of this article is published in: Baltussen, Lotte Belice; Blom, Jaap; Medjkoune, Leïla; Pop, Radu; Van Gorp, Jasmijn; Huurdeman, Hugo; Haaijer, Leidi 2014. Hard Content, Fab Front-End: Archiving Websites of Dutch Public Broadcasters, Alexandria Journal, Volume 25, Numbers 1-2, August 2014, pp. 69-91(23), DOI: 10.7227/ALX.0021© 2014. Editor: Monica Blake. Published by Manchester University Press.