{"667564":{"#nid":"667564","#data":{"type":"event","title":"PhD Proposal by Peng Li","body":[{"value":"\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ETitle:\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003E Cleaning and Learning over Dirty Tabular Data\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003EDate: \u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003EWednesday, May 3, 2023\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ETime: \u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003E14:00 \u2013 16:00 EST\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ELocation:\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003Cspan\u003E\u003Cspan\u003E\u0026nbsp;\u003Ca href=\u0022https:\/\/teams.microsoft.com\/l\/meetup-join\/19%3ameeting_MTFlNGFmYWUtNDViYi00NDJmLWFlYTAtNjY0NmQxNDE0Zjhl%40thread.v2\/0?context=%7b%22Tid%22%3a%22482198bb-ae7b-4b25-8b7a-6d7f32faa083%22%2c%22Oid%22%3a%22ee47036f-d707-4b61-9e8b-cd37b957d372%22%7d\u0022 title=\u0022https:\/\/teams.microsoft.com\/l\/meetup-join\/19%3ameeting_MTFlNGFmYWUtNDViYi00NDJmLWFlYTAtNjY0NmQxNDE0Zjhl%40thread.v2\/0?context=%7b%22Tid%22%3a%22482198bb-ae7b-4b25-8b7a-6d7f32faa083%22%2c%22Oid%22%3a%22ee47036f-d707-4b61-9e8b-cd37b957d372%22%7d\u0022\u003ETeams Link\u003C\/a\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003EPeng Li\u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EPh.D. Student in Computer Science\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003ESchool of Computer Science\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003ECollege of Computing\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EGeorgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003ECommittee: \u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Xu Chu (Advisor) \u2013 School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Joy Arulraj \u2013 School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Kexin Rong \u2013 School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EDr. Shamkant Navathe \u2013 School of Computer Science, Georgia Institute of Technology\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cstrong\u003E\u003Cspan\u003E\u003Cspan\u003EAbstract: \u003C\/span\u003E\u003C\/span\u003E\u003C\/strong\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EThe quality of machine learning (ML) applications is only as good as the quality of the data they train on. Due to noisy inputs from manual data curation or errors from automatic data collection programs, in reality, training data is, unfortunately, seldom free of errors. For this reason, data cleaning is widely regarded as an essential step in an ML workflow and an effective way for improving model quality. However, data cleaning is still a time-consuming task that heavily depends on human experts. Moreover, most existing data cleaning works treat data cleaning as a standalone task that is independent of its downstream applications. This separation of data cleaning from ML applications is problematic as it may not necessarily improve the ML performance and it can incur unnecessary data cleaning costs. \u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003E\u003Cspan\u003EIn this proposed thesis, I aim at designing theories, algorithms, and systems for cleaning and learning over dirty tabular data to (1) maximize the downstream ML model performance; and (2) minimize human cleaning costs. To achieve this goal, my research roadmap consists of four stages. First, I propose to establish the feasibility of this research by building a CleanML benchmark that empirically investigates the impact of data cleaning on downstream ML model performance. Second, I propose to design theories and algorithms for human-involved data cleaning, where the goal is to both maximize the ML model performance and minimize human cleaning costs. Third, I propose to design algorithms and systems to automate data cleaning for ML for structural tabular data, where the goal is to automatically select the data cleaning algorithm that maximizes the ML model performance without any human effort. Finally, I propose to design algorithms and systems to automatically transform non-structural tables into structural tables, which will extend the application of this work to all tabular data.\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/span\u003E\u003C\/p\u003E\r\n","summary":"","format":"limited_html"}],"field_subtitle":"","field_summary":[{"value":"\u003Cp\u003Esee abstract\u003C\/p\u003E\r\n","format":"limited_html"}],"field_summary_sentence":[{"value":"Cleaning and Learning over Dirty Tabular Data"}],"uid":"27707","created_gmt":"2023-04-28 19:47:37","changed_gmt":"2023-04-28 19:47:37","author":"Tatianna Richardson","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2023-05-03T14:00:00-04:00","event_time_end":"2023-05-03T16:00:57-04:00","event_time_end_last":"2023-05-03T16:00:57-04:00","gmt_time_start":"2023-05-03 18:00:00","gmt_time_end":"2023-05-03 20:00:57","gmt_time_end_last":"2023-05-03 20:00:57","rrule":null,"timezone":"America\/New_York"},"location":"REMOTE","extras":[],"groups":[{"id":"221981","name":"Graduate Studies"}],"categories":[],"keywords":[{"id":"102851","name":"Phd proposal"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1788","name":"Other\/Miscellaneous"}],"invited_audience":[{"id":"78771","name":"Public"},{"id":"174045","name":"Graduate students"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[],"email":[],"slides":[],"orientation":[],"userdata":""}}}