Whatโs left w/ foundation models? We found that they still can't ground modular concepts across domains.
We present Logic-Enhanced FMs:๐คFMs & neuro-symbolic concept learners. We learn abstractions of concepts like โleftโ across domains & do domain-independent reasoning w/ LLMs.
Excited to share that Iโll be starting my PhD in computer science at
@Stanford
this fall as a Knight Hennessy scholar & NSF fellow! Beyond grateful to all the wonderful people who have supported me on this journey.
Does GPT-4V understand geometric concepts as humans do?
We revisit Geoclidean, and ask GPT-4V to learn geometric concepts from few examples. We see that GPT-4V's performance in classifying geometric abstractions differs significantly from that of humans.
Can we give LLMs access to a visual scratchpad with diagrammatic abstractions and improve reasoning on text-based tasks?
Come to the (spoiler) I Can't Believe It's Not Better workshop at
@NeurIPSConf
on Saturday to find out!
How can we build a modular and compositional system that understands 3D scenes? Excited to introduce our
@CVPR
paper โ NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations, w
@maojiayuan
and
@jiajunwu_cs
. Check out our poster next week at Tue-AM-249.
Check out our paper at
@CVPR
tomorrow (day 1, session 1)! Presenting DARCNN: Domain Adaptive Region-based Convolutional Neural Network for Unsupervised Instance Segmentation in Biomedical Images. w Wah Chiu and
@syeung10
Will be presenting our paper
@NeurIPSConf
this Wednesday 12:30-2:00AM PT (poster session 3) โ Capturing Implicit Hierarchical Structure in 3D Biomedical Images with Self-Supervised Hyperbolic Representations! w
@jeffhygu
, Gong Her Wu, Wah Chiu, and
@syeung10
Will be presenting our paper
@NeurIPSConf
D&B in person this Thursday 11am-1pm CST (poster session 5) โ Geoclidean: Few-Shot Generalization in Euclidean Geometry ๐ w
@jiajunwu_cs
and Noah Goodman
Excited to talk about Logic-Enhanced Foundation Models (LEFT)
@NeurIPSConf
next week! Come chat with us on Tuesday morning at
#203
.
Try out our Colab notebook to train your own LEFT and learn concepts on a new dataset in ~100 lines of code.
๐Colab:
Whatโs left w/ foundation models? We found that they still can't ground modular concepts across domains.
We present Logic-Enhanced FMs:๐คFMs & neuro-symbolic concept learners. We learn abstractions of concepts like โleftโ across domains & do domain-independent reasoning w/ LLMs.
So excited to present our paper โ Open Data Standard and Analysis Framework: Towards Response Equity in Local Governments โ
@ACMEAAMO
tomorrow (11:30PT session)! This work has been a long time coming w
@CityofSanJose
&
@datakind
. w the incredible Ramya, Edwin, Christine.
We proposed the Geoclidean benchmark at
@NeurIPSConf
'22 to evaluate few-shot generalization of humans & vision models, and show that models cannot match human performance in geometric concept learning. 1 year later, it seems like this is still the case.
Will be presenting our paper
@NeurIPSConf
D&B in person this Thursday 11am-1pm CST (poster session 5) โ Geoclidean: Few-Shot Generalization in Euclidean Geometry ๐ w
@jiajunwu_cs
and Noah Goodman
Across 74 tasks, GPT-4V notably underperforms humans in understanding the underlying Euclidean geometry principles of geometric shapes.
These images span from commonly seen shapes such as equilateral triangles, to abstract shapes constructed from different geometric constraints.
How can we develop methods that conduct complex spatio-temporal reasoning over long-form human motion capture data?
Check out our
#ICML2023
paper on Motion Question Answering via Modular Motion Programs. Our poster session is tomorrow, Wed 7/26, at 2pm -
#613
.
๐งต๐
However, we find that visual readout from diagrams is still quite a challenging task for vision models.
Check out our full exploration with
@gabrielpoesia
@jiajunwu_cs
@noahdgoodman
!
Paper:
It isnโt clear whether GPT-4Vโs failure is due to perceptual factors (such as not distinguishing whether a line segment ends on a circle), or due to a bias to answer in ways consistent with the language (not faithfully relying on the image context in its response).
@ACMEAAMO
@CityofSanJose
@DataKind
Weโre launching an open data standard and analysis framework in the City of San Jose -- a publicly accessible system to understand data equity in government datasets.
@ACMEAAMO
@CityofSanJose
@DataKind
Increased academic & government collaboration is essential, and a great way to apply CS + social good to the real world. Weโre stoked to be launching our work with the
@CityofSanJose
! Grateful for support from
@sliccardo
,
@SJFD
, and the data equity team. Paper & website to come.
@ACMEAAMO
@CityofSanJose
@DataKind
Our system lies in three parts: 1) a US Census-linked interface to augment data with demographic factors, 2) centralized equity analyses to reduce technical barriers for understanding inequities, andย 3) an open data standard to improve data usability for policymaking.
When humans reason about complex questions, we often leverage diagrammatic abstractions drawn on a visual scratchpad.
We enable LLMs to do the same: execute draw commands & readout abstractions from the resulting diagram with a vision model that is finetuned w/ expert iteration.
We see that humans significantly outperform ImageNet pre-trained vision modelsโ low and high-level features across 74 tasks. We believe that this gap illustrates potential for improvement in learning visual features that align with human sensitivities to geometry.
@CVPR
@maojiayuan
@jiajunwu_cs
tldr; We show that we can integrate large language models with modular neural networks, and execute symbolic programs with learned concepts for complex 3D visual reasoning.
Paper:
Code:
The unified LEFT framework can perform visual reasoning in 2D, 3D, temporal motion, and robotic manipulation domains. It can also zero-shot transfer its concept knowledge to unseen tasks, through flexible LLM-generated logic & effective reuse of learned, modular visual concepts.
We propose Logic-Enhanced FMs as a general framework for concept learning & reasoning across domains and tasks. LEFT does not require predefined programs for new datasets and is easy to build on.
We release demos to show how to apply LEFT on a new dataset in ~100 lines of code!
LEFT leverages LLMs to take language queries and output programs in a general first-order logic reasoning language, shared across domains and tasks. LEFT's executor then executes the programs with learnable domain-specific grounding modules, initialized with LLM-parsed concepts.
We can do the same general decomposition and execution in a variety of domains and for a variety of tasks (see more examples on our project page). Concepts in language serve as abstractions that enable such generalization.
@GomezpoloDiego
@fredahshi
We run human experiments with Prolific in our paper, and show that human participants do in fact find consensus for most concepts, with responses from 30 participants for each concept. See Table 2 in our paper!
@NeurIPSConf
@jeffhygu
@syeung10
We show that models that capture implicit hierarchical relationships in biomedical images are better suited for unsupervised 3D segmentation, and propose a 3D hyperbolic VAE with a gyroplane convolutional layer as well as reconstruction of implicit hierarchies as a pretext task.
We show that we can leverage knowledge from large benchmark image datasets for instance segmentation on a wide range of unlabelled biomedical datasets. DARCNN bridges large domain shifts w simple inductive biases & sequential feature-level adaptation & image-level pseudolabeling
@fredahshi
Great question! As there are no negative examples, we could indeed generate infinitely many rules consistent with the positive images seen in the few-shot examples. Despite this, the generalization rule chosen by humans corresponds well to the Euclidean construction universe.
Notable prior works have proposed LLMs for reasoning with execution from pretrained VLMs, but they are inference-only and can't be made trainable. Our model (LEFT) can learn new concept grounding from data in domains wo/ predefined models, as its executor is fully differentiable.
@MuzafferKal_
@ayirpelle
We have not yet done a systematic study to categorize which rules GPT-4V use. In some qualitative samples, we see that it tends to find a broader rule than is used, for example, "triangle" as the discriminative rule, instead of the more specific & accurate "equilateral triangle".
@CVPR
@maojiayuan
@jiajunwu_cs
(2) an object-centric feature encoder, based on PointNet++, extracts object and relational representations of different arities between 3D object point clouds. We learn high-arity features by re-using binary features, reducing time and memory cost for inference.
@CVPR
@maojiayuan
@jiajunwu_cs
Our structured neuro-symbolic approach enables generalization to novel object co-occurrences and scenes, and zero-shot transfer to an unseen 3D question-answering task. NS3D can recompose learned models to build new QA operators, requiring no additional training (!)
@CVPR
@maojiayuan
@jiajunwu_cs
(3) a neural program executor, with functional modules implemented as lightweight neural networks, takes the symbolic program and representations to compute the output. Our encoder and executor can effectively reason about complex semantic forms with general arity-based programs.
The Geoclidean framework is easy to use, and our few shot generalization tasks are publicly available. Weโre excited to see different ways of exploring Euclidean geometry with Geoclidean!
Paper:
Poster:
Our Geoclidean framework realizes constructions in the Euclidean geometry universe; it includes a domain-specific language to define Euclidean constructions, and a renderer to instantiate abstract symbolic concepts into concrete images.
@CVPR
@maojiayuan
@jiajunwu_cs
(1) a semantic parser parses input language into symbolic programs, which resemble the underlying hierarchical reasoning structure of the language. We show that we can effectively leverage large language models to do this decomposition faithfully.
@NickEMoran
However, it's difficult to disentangle whether GPT-4V fails due to perception errors (cannot tell where line segments end), language bias (in training, lines tend to intersect with another object), or concept learning (the abstraction used to discriminate 'wug' is incorrect). 2/2
@CVPR
@maojiayuan
@jiajunwu_cs
We evaluate NS3D on the ReferIt3D task, a 3D referring expression comprehension benchmark, and show state-of-the-art results on tasks with complex view-dependent relations. Importantly, we show significantly improved data-efficiency results, crucial in the 3D domain.