TIMETool - Home

Overview

In this project, we propose an open-source library, which is an implementation of state-of-the-art time-aware IR approaches in MapReduce. This will serve as a toolkit for effective and efficient search on temporal Web collections. In such collections, documents are created and/or edited over time, and examples are web archives, news archives, blogs, personal emails and enterprise documents.

The expected outcome of this project can be seen as a pragmatic contribution aimed at supporting similar endeavors, i.e., prospective researchers or practitioners, to study temporal search or advance it into temporal exploration and analytics studies.

Follow us on Twitter

Follow @TIMEToolProject

Software Library:

Data Collections

We aim at working with two types of temporal document collections:

Non versioned: TREC Web collections (e.g., ClueWeb2009 and TREC Blog) and news archives (e.g., New York Times Annotated Corpus).
Versioned: UK and DE domain web archive collections

MapReduce Basic

MapReduce is a programming framework [9] for processing huge amounts of unstructured data in a massively parallel way.

Two main functions: a map and a reduce
Use of functional programming concept

Hadoop-based Tools

There are existing Hadoop-based frameworks that provide solutions for large-scale data analysis. We present here four frameworks, namely, HBase (for storage), SolrCloud+Hadoop (for indexing), Mahout (for machine learning) and R + Hadoop (for statistical analysis).

TIMETool Project: Large-Scale Temporal Search in MapReduce

Overview

Data Collections

MapReduce Basic

Hadoop-based Tools