Ontologies and their use in Information Retrieval (Tutor: Mauro Dragoni)
Provide an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. References will be made to multilinguality in semantic information retrieval via case studies.
Large scale and high dimensional issues in data indexing, processing and mining (Tutor: Stephane Marchand-Maillet)
Data analysis, learning and retrieval process are access-intensive operations. The large volume, fast velocity and important variability of data impose adapted and adaptive storage and access mechanisms for the data, also coping with the issues related to high-dimensional sparse space. Reference-based indexing using pivots or permutations has recently proved quite effective for such operations.
Provenance and quality (Tutor: Paolo Missier)
Big Data is typically characterised by a large volume, fast velocity and important variability (3 Vs). In contrast, its quality may be low and provenance uncontrolled. Increasing these factors may be a critical key to reduce data while preserving information.
The connection between provenance of data and quality of data has not been explored much in the recent past. Yet, it is a fairly intuitive connection to make: since provenance is essentially evidence for a data transformation process that has taken place, can data provenance be exploited, either manually or using analysis algorithms, to facilitate the assessment of the quality of the data at the end of the process? how about quality improvement?
This session will begin with an introduction to provenance, illustrating key definitions, formalisation of data provenance, and the essential elements of the W3C PROV recommendation for provenance modelling. It will then review recent contributions that specifically attempt to make the connection between data provenance and quality.
Finally, it will turn to the specific context of keyword search, asking more speculative questions where research is still be needed. For instance, suppose that a keyword based query engine is equipped with provenance tracing capabilities. This means tracing all elements of the query answering process. How can such traces be used to better understand the quality (reliability, credibility, precision) of the query results?
Text representation for Data mining (Tutor: Julian Szymanski)
The lecture will describe methods used for representation of textual content for processing it using data mining techniques. Basic approaches for text categorisation using supervised and unsupervised methods will be provided. A typical methods for dimensionality reduction will be introduced. During the lecture, a demonstration of proposed approaches will be shown in application for Wikipedia articles.
Search, Exploration and Analytics of Evolving Data (Tutor: Nattiya Kanhabua)
Information retrieval models and algorithms have been extremely successful during the last 20 years in providing everybody with easy access to the vast amount of information available on and through the Web. Over time, the growing volumes of and reliance on digital content in the Web have been observed, which demonstrate both its explosive evolution as well as its unprecedented temporal web dynamics.
Temporal web dynamics and how they impact upon various components of information retrieval (IR) systems have received a large share of attention in the last decade. In particular, the study of relevance in information retrieval can now be framed within the so-called time-aware IR approaches, which explains how user behavior, document content and scale vary with time, and how we can use them in our favor in order to improve retrieval effectiveness.
Compression techniques and linked data (Tutors: Miguel Ángel Martínez and Antonio Fariña)
The Linked Open Data initiative has motivated many data providers to release their data in the form of RDF datasets. This has led to a significant increase in the volume of RDF data available for different uses and purposes. Although it is good news for the community, it brings also a potential scalability issue for applications managing such collections of big semantic data. In this scenario, RDF compression arises as an essential tool not only for storage and exchange of data, but also for indexing purposes. In the first part of this tutorial, we will introduce the HDT (Header-Dictionary-Triples) serialization format as an ideal solution for addressing scalability issues arising of real Linked Data workflows, which are also described.
Although some approaches achieve compression by discarding triples which can be inferred from others (semantic redundancy), the most prominent compressors exploit the underlying syntactic redundancy within the RDF graph structure. In the second part of the tutorial, we will analyze how RDF can be effectively compressed by detecting and removing two different sources of syntactic redundancy. On the one hand, RDF serializations (of any kind) waste many bits due to the (repetitive) use of long URIs and literals. On the other hand, the RDF graph structure hides non-negligible amounts of redundancy by itself. These sources of redundancy are respectively addressed by the Dictionary and Triples of HDT. Both components will be described as conceptual containers for the compression techniques described in the last part of the tutorial.
Dictionary compression is a particular type of compression mainly focused on the symbolic redundancy of URIs and literals. Although there is not much specific work for compressing RDF dictionaries, we will review techniques for compressing (general) string dictionaries which can be used in this semantic scenario. We will pay special attention to the indexing of these compressed dictionaries to provide basic and advanced lookup operations (e.g. substring search). Regarding the RDF triples, we will present different techniques that allow us to save many storage space and still provide fast (indexed) lookup on the compressed triples. This is an essential feature to solve basic triple pattern queries which are, in practice, the core for providing full SPARQL resolution in a compressed triple store. This challenge opens new research opportunities which will be devised as a conclusion of this tutorial.
Lucene/Solr/ElasticSearch in practice (Tutor: Mauro Dragoni)
In this practical session we will give an overview of Lucene, Solr and ElasticSearch by presenting their features and the comparison between them. Solr and ElasticSearch in practice with a hands-on session aiming to create an IR system by starting from a small document collection.