{"671175":{"#nid":"671175","#data":{"type":"event","title":"PhD Defense by Peng Li","body":[{"value":"\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ETitle:\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003E\u0026nbsp;Cleaning and Learning over Dirty Tabular Data\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003EDate: \u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003EFriday, Dec 1, 2023\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ETime: \u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003E15:00 \u2013 17:00 EST\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ELocation:\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003E\u0026nbsp;\u003Ca href=\u0022https:\/\/teams.microsoft.com\/l\/meetup-join\/19%3ameeting_ZjZlNWI0ZWYtZGJhYS00OTBkLWExOWItYzU5NzFhM2NmYmIw%40thread.v2\/0?context=%7b%22Tid%22%3a%22482198bb-ae7b-4b25-8b7a-6d7f32faa083%22%2c%22Oid%22%3a%22ee47036f-d707-4b61-9e8b-cd37b957d372%22%7d\u0022 title=\u0022https:\/\/teams.microsoft.com\/l\/meetup-join\/19%3ameeting_ZjZlNWI0ZWYtZGJhYS00OTBkLWExOWItYzU5NzFhM2NmYmIw%40thread.v2\/0?context=%7b%22Tid%22%3a%22482198bb-ae7b-4b25-8b7a-6d7f32faa083%22%2c%22Oid%22%3a%22ee47036f-d707-4b61-9e8b-cd37b957d372%22%7d\u0022\u003ETeams Link\u003C\/a\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003EPeng Li\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EPh.D. \u003C\/span\u003E\u003C\/span\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003ECandidate\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003Cspan\u003E\u003Cspan\u003E\u0026nbsp;in Computer Science\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003ESchool of Computer Science\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003ECollege of Computing\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EGeorgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ECommittee:\u0026nbsp;\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Xu Chu (advisor) \u2013 School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Kexin Rong (co-advisor) \u2013 School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Joy Arulraj \u2013 School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Shamkant Navathe \u003Cspan\u003E\u2013\u003C\/span\u003E\u0026nbsp;School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Yeye He \u003Cspan\u003E\u2013 Data Management, Exploration and Mining Group, Microsoft Research\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003EAbstract:\u0026nbsp;\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EThe quality of machine learning (ML) applications is only as good as the quality of the data they train on. Unfortunately, real-world data is rarely free of errors, especially for tabular data, which frequently suffers from data issues like missing values, outliers, and inconsistencies. Therefore, data cleaning is widely regarded as an essential step in an ML workflow and an effective way to improve ML performance. However, data cleaning is often a time-consuming task that reportedly takes up to 80% of data scientists\u0027 time. Traditional data cleaning approaches often treat data cleaning as a standalone task independently of its downstream applications, which may not effectively improve ML performance and can sometimes even worsen it. Furthermore, it often leads to unnecessary costs for cleaning errors that have a minor impact on ML performance.\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EThis dissertation jointly considers data cleaning and machine learning, and focuses on developing algorithms and systems for cleaning and learning over dirty tabular data, with the dual objectives of (1) optimizing downstream ML performance and (2) minimizing human efforts. We start with a CleanML empirical study that systematically evaluates the impact of data cleaning on downstream ML performance. We then present CPClean, a cost-effective human-involved data cleaning algorithm for ML that minimizes human cleaning efforts while preserving ML performance. We subsequently demonstrate DiffPrep, an automatic data preprocessing method that can efficiently select high-quality data preprocessing (cleaning) pipelines to maximize downstream ML performance without human involvement. Finally, to obviate the need for humans to manually program table-restructuring transformations, we present Auto-Tables that can automatically transform tables from non-standard formats into a standard format without any human effort. Combining the works in this dissertation, we build an end-to-end system for cleaning and learning over dirty tabular data.\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n","summary":"","format":"limited_html"}],"field_subtitle":"","field_summary":[{"value":"\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003ECleaning and Learning over Dirty Tabular Data\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n","format":"limited_html"}],"field_summary_sentence":[{"value":"Cleaning and Learning over Dirty Tabular Data"}],"uid":"27707","created_gmt":"2023-11-20 22:17:44","changed_gmt":"2023-11-20 22:17:44","author":"Tatianna Richardson","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2023-12-01T15:00:00-05:00","event_time_end":"2023-12-01T17:00:00-05:00","event_time_end_last":"2023-12-01T17:00:00-05:00","gmt_time_start":"2023-12-01 20:00:00","gmt_time_end":"2023-12-01 22:00:00","gmt_time_end_last":"2023-12-01 22:00:00","rrule":null,"timezone":"America\/New_York"},"location":"TEAMS","extras":[],"groups":[{"id":"221981","name":"Graduate Studies"}],"categories":[],"keywords":[{"id":"100811","name":"Phd Defense"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1788","name":"Other\/Miscellaneous"}],"invited_audience":[{"id":"78771","name":"Public"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[],"email":[],"slides":[],"orientation":[],"userdata":""}}}