Sushovan De :: Research

My research interests are in data cleaning in the context of information retrieval, and in probabilistic databases.

Data Cleaning

Try our system out: Download BayesWipe - a fully automatic database cleaner.

My thesis work is in data cleaning. Databases nowadays are being increasingly generated by casual users instead of being carefully curated by dedicated employees. Combined with the sheer volume of data being generated, data cleaning has become very important, yet the existing methods scale to neither the current volume of data, nor the variety of errors. My work details a system which uses the dirty data itself to learn a model of the clean data as well as the model of the errors in the data, and to clean the data in a probabilistically principled manner.

Relevant publications:

Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, Subbarao Kambhampati (2016). BayesWipe: A Scalable Probabilistic Framework for Improving Data Quality. ACM JDIQ Special Issue on Web Data Quality. [PDF] [BiBTeX]
Sushovan De, Yuheng Hu, Yi Chen, Subbarao Kambhampati (2014). BayesWipe: A Multimodal System for Data Cleaning and Consistent Query Answering on Structured BigData. IEEE BigData. [PDF] [BiBTeX]
Sushovan De, Yuheng Hu, Yi Chen, Subbarao Kambhampati (2014). BayesWipe: A Multimodal System for Data Cleaning and Consistent Query Answering on Structured Data. Big Uncertain Data (BUDA). [PDF] [BiBTeX]
Sushovan De (2014). Unsupervised Bayesian Data Cleaning Techniques For Structured Data. Ph.D. dissertation. [PDF] [BiBTeX]
Rohit Raghunathan, Sushovan De, Subbarao Kambhampati (2013). Bayes Networks for Supporting Query Processing Over Incomplete Autonomous Databases. Journal of Intelligent Information Systems. [PDF] [BiBTeX]
Sushovan De (2013). Unsupervised Bayesian Data Cleaning Techniques For Structured Data. Ph.D. prospectus document. [PDF] [BiBTeX]
Yuheng Hu, Sushovan De, Yi Chen and Subbarao Kambhampati (2012). Bayesian Data Cleaning for Web Data. arXiv preprint. [PDF] [BiBTeX]

Collective Entity Classification

Traditionally, the problem of indentifying ambiguous entities in documents has been solved by looking at the immediate neighborhood of the entity within the document itself. In this work (done in an internship in IBM Research Labs, Bangalore, India), we looked at using signals from across documents to classify entities. The principal idea is that similar documents can often provide valuable clues to disambiguate entities that could not be otherwise classified.

Relevant publication:

Amit K Singh, Karthik Visweswariah, Sushovan De (2012). Annotating Entities Using Cross-Document Signals. US Patent application US20130325849 A1.

Planning and Crowdsourcing

The best hole-in-the-wall restaurants in New York are known to New Yorkers, not to yelp. Yet, when making a travel plan, sources like this are easily overlooked due to the difficulty in getting, organizing, and tailoring this information into a useful travel plan. We experimented with having an automated planner take a travel requirements and constraints as input, and ask humans for recommendations. We then made the planner automatically check constraints and guide the humans towards making a better, more complete plan.

Relevant publications:

Lydia Manikonda, Tathagata Chakraborty, Sushovan De, Kartik Talamadupula, Subbarao Kambhampati (2014). AI-MIX: How a Planner Can Help Guide Humans Towards a Better Crowdsourced Plan. Innovative Applications of Artificial Intelligence (IAAI 2014) [PDF] [BiBTeX]
Won the 2014 ICAPS System Demonstration People's Choice Award.
Lydia Manikonda, Tathagata Chakraborty, Sushovan De, Kartik Talamadupula, Subbarao Kambhampati (2014). AI-MIX: Using Automated Planning to Steer Human Workers Towards Better Crowdsourced Plans. Scheduling and Planning Applications woRKshop (SPARK 2014). [PDF] [BiBTeX]

Social Networks Analytics

The avenues through which we express ourselves have dramatically changed from the physical world to online, which lack the expressive power of direct human conversation. It has become harder to detect mental health issues like depression and social anxiety. In this work, we study one particular social network, reddit, for characteristics that indicate issues of mental health. We also investigate how much the promise of anonymity online makes people likely to share their true feelings.

Relevant publication:

Munmun De Choudhury, Sushovan De (2014). Mental Health Discourse on reddit: Self-disclosure, Social Support, and Anonymity. International AAAI Conference On Weblogs And Social Media (ICWSM 2014) [PDF] [BiBTeX]

Probabilistic Databases

Most data today comes with a degree of uncertainty, yet we continue to store them in deterministic databases since creating, querying and bookkeeping of probabilistic databases is a difficult problem. In this work, I extended the definitions of the various kinds of functional dependencies (Deterministic, Approximate, and Conditional) to the realm of probabilistic databases. I proved certain properties for these dependencies and demonstrated efficient algorithms to compute their confidence in a given probabilistic database - and hence mine them.

Relevant publication:

Sushovan De, Subbarao Kambhampati (2010). Defining and Mining Functional Dependencies in Probabilistic Databases. arXiv preprint. [arXiv] [BiBTeX]