<node id="671263">
  <nid>671263</nid>
  <type>event</type>
  <uid>
    <user id="27707"><![CDATA[27707]]></user>
  </uid>
  <created>1701119027</created>
  <changed>1701119027</changed>
  <title><![CDATA[PhD Proposal by Arjun Majumdar]]></title>
  <body><![CDATA[<p><span><span><strong><span>Title:</span></strong><span> Large-scale Offline Pre-training to Enable Embodied Intelligence</span></span></span></p>

<p>&nbsp;</p>

<p><span><span><strong><span>Arjun Majumdar</span></strong></span></span></p>

<p><span><span><span>Ph.D. Student in Computer Science</span></span></span></p>

<p><span><span><span>School of Interactive Computing&nbsp;</span></span></span></p>

<p><span><span><span>Georgia Institute of Technology</span></span></span></p>

<p>&nbsp;</p>

<p><span><span><strong><span>Date:</span></strong><span> November 29th, 2023</span></span></span></p>

<p><span><span><strong><span>Time:</span></strong><span> 3:00pm - 5:00pm ET / 12:00pm - 2:00pm PT</span></span></span></p>

<p><span><span><strong><span>Location:</span></strong> <a href="https://gatech.zoom.us/j/95836141378?pwd=WU9keFBWa0U5ekcvVWRBRXBLcHZQdz09&amp;from=addon;%20Coda%20C1215%20Midtown"><span>zoom link</span></a><span>; Coda C1215 Midtown</span></span></span></p>

<p><span><span><strong><span>Committee:</span></strong></span></span></p>

<p><span><span><span>Dr. Dhruv Batra (Advisor) -- School of Interactive Computing, Georgia Institute of Technology</span></span></span></p>

<p><span><span><span>Dr. Zsolt Kira -- School of Interactive Computing, Georgia Institute of Technology</span></span></span></p>

<p><span><span><span>Dr. James Hays -- School of Interactive Computing, Georgia Institute of Technology</span></span></span></p>

<p><span><span><span>Dr. Jitendra Malik -- University of California Berkeley</span></span></span></p>

<p><span><span><span>Dr. Vincent Vanhoucke – Google DeepMind</span></span></span></p>

<p><span><span><span>Dr. Vladlen Koltun -- Apple</span></span></span></p>

<p>&nbsp;</p>

<p><span><span><strong><span>Abstract:</span></strong></span></span></p>

<p><span><span><span>A central goal in Artificial Intelligence is building embodied agents (such as mobile robots) that are generalists -- capable of assisting with a wide-variety of tasks (specified in natural language) in any environment or setting. Such agents must understand a vast diversity of concepts in the visual world and be able to ground (or associate) this understanding with language to allow users to describe tasks and goals. How can we develop agents with such an extensive and functional understanding of the world?</span></span></span></p>

<p>&nbsp;</p>

<p><span><span><span>In this thesis, we will argue that offline pre-training of foundation models on web-scale data enables embodied intelligence. In part 1, we present VC-1, a visual foundation model pre-trained (primarily) on video data collected from an egocentric perspective. We systematically demonstrate that such a model substantially benefits from increasing pre-training dataset diversity by introducing CortexBench, an embodied AI (EAI) benchmark curated from a diverse collection of existing EAI tasks (requiring locomotion, navigation, and dexterous and mobile manipulation of objects). In part 2, we first demonstrate that visual grounding learned from internet data (i.e., image-caption pairs from the web) can be transferred to an instruction-following visual navigation agent (VLN-BERT). Then, we present ZSON, a highly scalable approach for learning to visually navigate to objects specified in open-vocabulary, natural language instructions such as “find the kitchen sink.” The key idea is to leverage a pre-trained visiolinguistic embedding space (from CLIP) to decouple learning to represent semantic goals (such as a “a kitchen sink”) from learning to navigate to semantic goals. Finally, in proposed work, we will study combining vision-and-language models (VLMs) with large language models (LLMs) for the task of embodied question-answering (EQA), which requires an agent to answer open-ended questions about real-world environments.</span></span></span></p>
]]></body>
  <field_summary_sentence>
    <item>
      <value><![CDATA[ Large-scale Offline Pre-training to Enable Embodied Intelligence]]></value>
    </item>
  </field_summary_sentence>
  <field_summary>
    <item>
      <value><![CDATA[<p><span><span><span>&nbsp;Large-scale Offline Pre-training to Enable Embodied Intelligence</span></span></span></p>
]]></value>
    </item>
  </field_summary>
  <field_time>
    <item>
      <value><![CDATA[2023-11-29T15:00:00-05:00]]></value>
      <value2><![CDATA[2023-11-29T17:00:24-05:00]]></value2>
      <rrule><![CDATA[]]></rrule>
      <timezone><![CDATA[America/New_York]]></timezone>
    </item>
  </field_time>
  <field_fee>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_fee>
  <field_extras>
      </field_extras>
  <field_audience>
          <item>
        <value><![CDATA[Public]]></value>
      </item>
      </field_audience>
  <field_media>
      </field_media>
  <field_contact>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_contact>
  <field_location>
    <item>
      <value><![CDATA[Coda C1215 Midtown]]></value>
    </item>
  </field_location>
  <field_sidebar>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_sidebar>
  <field_phone>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_phone>
  <field_url>
    <item>
      <url><![CDATA[]]></url>
      <title><![CDATA[]]></title>
            <attributes><![CDATA[]]></attributes>
    </item>
  </field_url>
  <field_email>
    <item>
      <email><![CDATA[]]></email>
    </item>
  </field_email>
  <field_boilerplate>
    <item>
      <nid><![CDATA[]]></nid>
    </item>
  </field_boilerplate>
  <links_related>
      </links_related>
  <files>
      </files>
  <og_groups>
          <item>221981</item>
      </og_groups>
  <og_groups_both>
          <item><![CDATA[Graduate Studies]]></item>
      </og_groups_both>
  <field_categories>
          <item>
        <tid>1788</tid>
        <value><![CDATA[Other/Miscellaneous]]></value>
      </item>
      </field_categories>
  <field_keywords>
          <item>
        <tid>102851</tid>
        <value><![CDATA[Phd proposal]]></value>
      </item>
      </field_keywords>
  <userdata><![CDATA[]]></userdata>
</node>
