The KrdWrd Project

Archived Web site

View My GitHub Profile


The KrdWrd Project ran from 2008 to 2011. The mission statement was

Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.

Develop a classification engine that learns to automatically annotate pages, and provide visual tools for inspection of results.

Basically, it was an infrastructure for research into web page cleaning. A good overview can be found in the paper and an extensive description in the master’s thesis (both, see further down).


  1. The annotation guidelines and the Firefox add-on manual are still available online and as pdf file.

  2. The CANOLA Corpus

  3. A snapshot of the original Wiki, which holds more information on many topics (originally accessible at

System Components

The system consisted of

Cite Work

Paper (WAC5)

  abstract = {Algorithmic processing of Web content mostly works on textual contents, neglecting visual information. Annotation tools largely share this deficit as well. We specify requirements for an architecture to overcome both problems and propose an implementation, the KrdWrd system. It uses the Gecko rendering engine for both annotation and feature extraction, providing unified data access in every processing step. Stable data storage and collaboration control scripts for group annotations of massive corpora are provided via a Web interface coupled with a HTTP proxy. A modular interface allows for linguistic and visual data feature extractor plugins. The implementation is suitable for many tasks in theWeb as corpus domain and beyond.},
  author = {Steger, Johannes and Stemle, Egon},
  booktitle = {Proceedings of the {Fifth Web} as {Corpus Workshop} ({WAC5})},
  date = {2009-09},
  editor = {Alegria, Iñaki and Leturia, Igor and Sharoff, Serge},
  location = {{Donostia-San Sebastian, Basque Country, Spain}},
  pages = {63-70},
  publisher = {{Elhuyar Fundazioa}},
  title = {{KrdWrd}: {Architecture} for {Unified Processing} of {Web Content}},
  url = {}

Annotation Guidelines and Firefox Add-On Manual

  abstract = {"The availability of large text corpora has changed the scientific approach to language in linguistics and cognitive science" [M\&S]. Today, the by far richest source for authentic natural language data is the World Wide Web, and making it useful as a data source for scientific research is imperative. Web pages, however, can not be used for computational linguistic processing without filtering: They contain code for processing by the Web browser, there are menus, headers, footers, form fields, teasers, out-links, spam-text – all of which needs to be stripped.
The dimension of this task calls for an automated solution, the broadness of the problem for machine learning based approaches. Part of the KrdWrd project deals with the development of appropriate methods, but they require hand-annotated pages for training.
The KrdWrd Add-on aims at making this kind of tagging of Web pages possible. For users, we provide accurate Web page presentation and annotation utilities in a typical browsing environment, while preserving the original document and all the additional information contained therein.},
  author = {{The KrdWrd Team (}},
  date = {2010},
  publisher = {{The KrdWrd Project}},
  title = {Add-on manual},
  url = {}

Master’s Thesis

  abstract = {This thesis discusses the KrdWrd Project. The Project goals are to provide tools and infrastructure for acquisition, visual annotation, merging and storage of Web pages as parts of bigger corpora, and to develop a classification engine that learns to automatically annotate pages, operate on the visual rendering of pages, and provide visual tools for inspection of results.},
  author = {Stemle, Egon W.},
  date = {2009-04},
  institution = {{University of Osnabrück}},
  pubstate = {Unpublished},
  title = {Hybrid {{Sweeping}}: {{Streamlined Perceptual Structured}}-{{Text Refinement}}},
  type = {Master's thesis},
  url = {}