Significant advances have been made in AI technologies in recent decades that led to a slew of AI enabled robots and smart devices in a variety of applications including self-driving cars, digital personal assistants, autonomous drones, etc. These agents function reasonably well in well-defined and structured environments. However, they generally exhibit limited adaptabilities and robustness in interacting with unfamiliar settings and scenarios. Despite recent advances in AI, it is well acknowledged that the current level of intelligence embodied in these agents are not sufficient to tackle complex challenges arising from previously unencountered events or unscripted interactions with humans. iMaple’s research focus is on embodying these agents with improved intelligence for understanding of their environments and perception of human needs. For achieving these goals, iMaple explores ways to enable agents to intelligently perceive and learn the surroundings and interact appropriately and usefully with humans. We consider multimodal approaches, including computer vision, audio recognitions, acoustic scene classifications, and other modality based methods for developing novel concepts and innovations in machine learning.