@ViewBag.Title=Research - Sushovan @ViewBag.SidebarClassResearch=selected

Sushovan De :: Research

My research interests are in data cleaning in the context of information retrieval, and in probabilistic databases.

Data Cleaning

My thesis work is in data cleaning. Databases nowadays are being increasingly generated by casual users instead of being carefully curated by dedicated employees. Combined with the sheer volume of data being generated, data cleaning has become very important, yet the existing methods scale to neither the current volume of data, nor the variety of errors. My work details a system which uses the dirty data itself to learn a model of the clean data as well as the model of the errors in the data, and to clean the data in a probabilistically principled manner.

Relevant publications:

Collective Entity Classification

Traditionally, the problem of indentifying ambiguous entities in documents has been solved by looking at the immediate neighborhood of the entity within the document itself. In this work (done in an internship in IBM Research Labs, Bangalore, India), we looked at using signals from across documents to classify entities. The principal idea is that similar documents can often provide valuable clues to disambiguate entities that could not be otherwise classified.

Relevant publication:

Probabilistic Databases

Most data today comes with a degree of uncertainty, yet we continue to store them in deterministic databases since creating, querying and bookkeeping of probabilistic databases is a difficult problem. In this work, I extended the definitions of the various kinds of functional dependencies (Deterministic, Approximate, and Conditional) to the realm of probabilistic databases. I proved certain properties for these dependencies and demonstrated efficient algorithms to compute their confidence in a given probabilistic database - and hence mine them.

Relevant publication: