MSc Projects:

Framework for AI System Output Evaluation by Humans

In cooperation with:  Brmson project | Student: Not assigned yet

In design of artificial intelligence systems that are interactive by nature and have free-form output, for example Question Answering, dialog, chat bot or figure synthesis systems, it is difficult to evaluate the performance of the system on a large dataset as many answers cannot be judged correct or incorrect simply by matching them against a predefined template - a human needs to enter the loop and  evaluate. The task here is surveying the area for existing solutions andbuilding a framework for human evaluation of results - capable of both interactive evaluation (users ask questions and evaluate if the output is correct) and batch evaluation (users evaluate sets of pre-generated answers of past questions), and supporting aggregation, analysis, memo-ization and export of results.  The framework should be reasonably generic, but we will apply it to the Question Answering domain, on the "brmson" system developed in our group.

Smart Web Interface for Question Answering System

In cooperation with:  Brmson project | Student: Not assigned yet

In our group, we are building the "brmson" system for question answering - the user enters a question like "What is the name of the southwesternmost tip of England?", "Who received the Peace Nobel Prize in 2014?", "What hair color did Thomas Jefferson have before grey?", "What is the distance of Earth from Sun?" or "How to change a flat tire?"; the system generates a hopefully correct reply (or several possible reploies).  It can be thought of as a simpler version of Google where the input are not just keywords but a fully phrased question, and the output are not search results, but the actual information. The task here is building a web interface for such a system.  However, this is not a trivial matter of an input box and simple results page; the usefulness of the system would radically improve if we made it *more* like (say) Google, letting the user explore potential replies and letting the system justify them by evidence it found - links to source material, answer type evaluation, etc.; all this should be highly interactive and user friendly.  The system may take some time to generate various answers, so the view all also needs to be fluid and displayed incrementally.

Numerical Entity Extraction from Text

In cooperation with:  Brmson project | Student: Not assigned yet

In our group, we are building the "brmson" system for question answering.  The answer for many questions like "What is the distance of  Earth from Sun?", "What is the maximum height of a Japanese train?", "How wide are the train rails?" or "What is the critical mass of plutonium?" are numerical quantities.  Sometimes, they are stored in a structured database that we can simply query, but all too often they are embedded in free text Wikipedia articles and such. The task here is building an NLP system that scans massive amounts of unstructured text (e.g. English Wikipedia) and extract numerical entities that specify such relations, like "Shinkansen network employs standard gauge and maximum width of 3.40 m (11 ft 2 in) and maximum height of 4.50 m (14 ft 9 in)"; the system should generate Shinkansen.width and Shinkansen.height values from such text.  Often (not always) the values will be accompanied by units that can help infer the type of relation (but they can come from different measurement systems). However, notice that while kilograms will typically represent mass, meters can represent either of distance, height, width and more; the system will need to use other cues to distinguish these. The goal is not building an extensive set of hard-coded heuristics, but rather a flexible machine learning system that will infer the extraction rules automatically; we will help you to focus at the right state-of-art algorithms and getting up to speed on them.  This is a hard problem, but we do not have to fully solve it - some good progress ismore than enough for an excellent thesis.

Detection of Anomalies in User Behaviour from Search Engine Query Logs

In cooperation with: Seznam.cz | Student: Tomáš Vyskočil | Project tracking

In order to improve the quality of search and the overall user experience the search engines try to leverage the data and statistics collected from the user interactions with the system. These statistics can help to determine the relevances of web pages to given queries or determine what queries will be suggested in the query auto-completion. But such search engine systems are jeopardize by malicious intents of some users, who are trying to promote their own web pages by artificially skewing these statistics (by re-issuing the same query many times or clicking on their results repeatedly) in their own favor. Programs (so-called robots) that have been created for this purpose can issue thousands of actions which makes it a real problem. The goal of this project is to design models that can identify users (and robots) with “suspicious” (malicious) behaviour out of the query logs and click logs. These models could be than used to mitigate the problems with exploitation of such search engine systems.

Phishing Email Detection in Czech Language

In cooperation with: Seznam.cz | Student: Vít Listík | Project tracking

Lorem ipsum dolor sit amet, duo ne fierent persecuti. Qui ignota verear incorrupte ei, sea eu dictas everti fabulas. Eu eam summo eirmod deleniti. Ne melius indoctum nec, simul legere feugait eu vis, cum eu populo numquam. Eu has alia mollis menandri.

Clickstream analysis

In cooperation with: Avast! | Student: Dušan Jenčík | Project tracking

The target of this project is categorization of users from clickstreams. The clickstreams contain the URLs which are visited by users while browsing the web. Our task is to find who is the customer, i.e. woman, man, child, how old etc.

Document abstract synthesis

In cooperation with: Seznam.cz | Student: Jonáš Amrich | Project tracking

The information retrieval engines need inverse index of documents for search or queries. The inverse index is generated by robots crawling the web and downloading the content of the pages. Some of the pages do not allow the robots to download the content, some of them are in graphics, some are written in JavaScript. For these pages we need to generate the title or snippets automatically.

Android tablet in-car application

In cooperation with: | Student: Michael Bláha | Project tracking

Study the use of the smartphones and tablets with respect to the car environment. Design an Android application for the in-car use. The user interface needs to minimize the secondary task cognitive load for the driver. Minimize the text entry for item selection using the information from favorites, history, context and other sources such as calendar, contacts etc. Demonstrate the selected UI actions on implementing user interfaces for:

  • The phone, allowing call reception, call placement selecting from a list of contacts.
  • The media player, selecting the genre, artist, album, song etc.
  • Entry of destination to the navigation, use the Google maps navigation,

Implement system for monitoring and accumulating the users actions on the server site, use some of the standard analytics solutions (Google Analytics, Flurry). The primary platform is the Android tablet. Provide the basic code testing.

Android smartphone in-car application

In cooperation with: | Student: Lukáš Hrubý | Project tracking

Study the use of the smartphones and tablets with respect to the car environment. Design an Android application for the in-car use. The user interface needs to minimize the secondary task cognitive load for the driver. Minimize the text entry for item selection using the information from favorites, history, context and other sources such as calendar, contacts etc. Demonstrate the selected UI actions on implementing user interfaces for:

  • The phone, allowing call reception, call placement selecting from a list of contacts.
  • The media player, selecting the genre, artist, album, song etc.
  • Entry of destination to the navigation, use the Google maps navigation,

Implement system for monitoring and accumulating the users actions on the server site, use some of the standard analytics solutions (Google Analytics, Flurry). The primary platform is the Android Smartphone. Provide the basic code testing.


 PhD Projects:

Load Forecasting for Cloud Computing

Student: Tomáš Vondra |

Cloud computing is the last advance in data center management. In its IaaS (Infrastructure as a Service) form, it allows for rapid provisioning of virtualized server, storage and network resources with minimal user interaction. Using add-on configuration and deployment tools, or the next layer of cloud, PaaS (Platform as a Service), applications can be deployed to these server instances. The automation possibilities given by cloud APIs (Application Programming Interfaces) offer a great deal of ground for research on how to use the resources optimally and for better user satisfaction.
The user will want to conserve the costs by using automatic scaling, which is available either from the provider or as a third party service. These services offer the possibility to run a small number of instances persistently and to boost the computing power when the offered load demands it. The available scaling services are mostly reactive - they react to the measured level of resources in the virtual machines. If proactive scaling methods employing load forecasting were used, the autoscaling service could avoid overload situations and thus the GoS (Grade of Service) could be raised.
Moreover, the capacity of the private cloud is not (even seemingly) infinite and is a known quantity. If it is not sufficient (all of the time or, perhaps, only on peak hours), then the user can turn to hybrid cloud and cloudbursting.
It is also possible to conserve money in an underutilized private cloud by turning off unused computers.
The cloud, private or public, besides being a platform for web application, can also take the role previously occupied by the grid, that is batch computing. Data mining tasks can be used to fill unused computing resources, provided the capacity is forecast correctly as not to violate SLAs of the primary services running on the cloud.

Boosting for Learning to Rank with Query Log Data

Student: Tomáš Tunys | Scientific Report

The performance of almost every machine learning algorithm, including the learning to rank algorithms, heavily depends on the size and the quality of the training dataset. The process of gathering large datasets is very expensive, since the relevance of the documents to the queries is judged by human assessors. It is also subjective and disagreements among the judgements are inevitably emerging. Some studies are showing that a certain level of noise in relevance judgements has a small effect on the final evaluation [Harter, 1996, Voorhees, 1998, Bailey et al., 2008], however I believe that a better exploitation of available data will help us to train more accurate models leading to better ranking.

Fortunately, the current search engines save query logs. They contain massive amount of data about users behaviour and their needs. This information can be exploited to mitigate the data quantity and quality to improve the above stated issues. In this sense, specifically useful sources of information are clickthrough, dwell time, and query reformulation data, all of which can be used as noisy substitute for relative information about the relevance of the documents. For example, clickthrough rate data has been shown to improve the performance of RankSVM [Joachims, 2002] and can be used to improve the performance of any ranking algorithm by correcting the relevance judgements in the training data [Xu et al., 2010]. Moreover, query chains, i.e. sequences of query reformulations gathered in a session during which the user was trying to find the desired information, contain even more information which can be used to improve relevance judgements, which was shown by [Radlinski and Joachims, 2005].

To our knowledge none of the current state-of-the-art learning to rank algorithms, such as LambdaMART [Burges, 2010] or YetiRank [Gulin et al., 2011], which have been part of the winning solution to the Yahoo! Learning To Rank Challenge competition [Chapelle et al., 2011], have been trained on documents, clickthrough, and query reformulation data at once. I think, that there lies a hidden potential to boost these well performing algorithms even more and therefore I decided to study LambdaMART/YetiRank and related algorithms and try to incorporate the query log data into the process of training of these algorithms.

Brmson: Towards Crystal Ball Level Question Answering!

Student: Petr Baudiš |

One of the ultimate applications of information retrieval, information extraction and related scientific topics is the task of Question Answering,
where the computer receives a free-form question from the user (like "What is the name of the southwesternmost tip of England?",
"Who received the Peace Nobel Prize in 2014?", "What hair color did Thomas Jefferson have before grey?",
What is the distance of Earth from Sun?" or "How to change a flat tire?")
and generates a hopefully correct
reply to the question. It can be thought of as a simpler version of Google where the input are not just keywords but a fully phrased
question, and the output are not search results, but the actual information.

Question Answering is an active area of research - it can be performed purely over structured databases, wrapping a general-purpose search engine
(like Google) or, in the most flexible case, itself perform information retrieval from various unstructured sources (e.g. freetext Wikipedia articles).
The state-of-art systems include for example IBM Watson "DeepQA" winning the Jeopardy! match against human champions, personal assistants like Siri and
even Google itself is starting to include specific answers in its search results. Most of these systems deal with open domain ("trivia") questions,
which is what we use for benchmarking as well, but our system also lends very well to extensions to specific domains.

Many systems are merely collections of case-by-case heuristics dealing with specific kinds of questions or matching specific answer types. Our goal is
building a general system that requires only a minimum of hand-crafted heuristics and instead leverages machine learning algorithms to weigh evidence and
choose the right answers. Our baseline framework brmson YodaQA
(currently rapidly approaching version 1.0) is fully open source and seeks to become the best performing open source
QA system. Our basic architecture closely matches simplified IBM Watson DeepQA and is implemented mainly in Java, using the UIMA framework.

The focus of this project, aside of developing the baseline framework open for any other researches to build upon,
lies in investigation of better answer type representations than LAT (single, plain English words describing the answer type), composite answer
representations
(such as lists of items or processes) and reasoning on implicitly represented data spread over unstructured free texts.

Information Extraction from REST documentation

Student: Tomáš Gogár |

During the last decade various studies have reported an increasing usage of Web Services. The concept of Web services has become so popular because of its flexibility and easy usage. Nowadays the majority of public service providers (such as Google, Yahoo, Facebook, etc.) use REST architectural style based on HTTP protocol for their APIs.

With the enormous number of available REST services a new issue has emerged - the potential users of the services need an efficient and standard method for finding suitable services and get sufficient information how to use them. Unfortunately there is no standard and widely used way of documenting and discovering REST APIs - usually every service provider documents it's API on custom website. There are some initiatives which try to manually populate registry of accessible services such as ProgrammableWeb (www.programmableweb.com). The process of manual data acquisition brings two main issues - the large part of the available services is still not listed and information about those which are included rapidly becomes obsolete. Another possible way of exploring useful services is using generic search engines (such as Google), but the results from these engines are often not accurate enough and do not provide information in machine readable format.

In this project we would like to apply machine learning algorithms to automatically extract information from the documentation web pages, which will be visited by our crawler. This task, where the structured information is extracted from from unstructured or semi-structured texts, is often referred to as Information Extraction. We hope that using this approach we would be able to identify large amount of existing APIs and provide up to date information for developers.