Research Scientist at Meta Reality Labs in Boston
Prev PhD at RWTH Aachen + Carnegie Mellon + Uni Oxford
Dynamic 3D Gaussians + SplaTAM + HOTA + more
From NZ
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
For all you NeRF people:
Instant-NGP (the crazy fast train a NeRF in 5 second method), has released pytorch bindings (which make it MUCH easier to use, compared to raw CUDA code).
Excited to present our
@CVPR
Oral paper HODOR
Typically, Video Object Segmentation methods learn low-level pixel correspondence.
Instead, we use transformers to extract high-level object embeddings, that can be used to re-segment objects through video.
This is my first follow-up work after Dynamic 3D Gaussians.
Nikhil (
@Nik__V__
) and I have been working closely to build a system that can tackle SLAM using Gaussian Splatting โ estimating camera poses without COLMAP and working in real-time on live streaming data.
SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
We extend Gaussian Splatting to solve SLAM, i.e., automatically calculate the camera poses when fitting the Gaussian scene from RGB-D videos.
Try it on your own iPhone capture today! ๐งต๐
This enables a number of exciting applications such as the composition of different dynamic scene elements, first-person view synthesis and temporally consistent 4D scene editing.
Itโs also FAST! These render at 850 FPS, and only take around 2 hours to train on a single GPU.
I also made an interactive dynamic 3D viewer.
I honestly think this is going to be the future of all of entertainment.
Movies + Games converging to the same the thing.
The future will for sure be 'Dynamic' and '3D', and my bet is on it being made of Gaussians.
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
Interesting fact: I was working on Dynamic 3D Gaussians for a while before the 3D Gaussian Splatting paper came out. Originally I used the โfuzzy metaballโ Gaussians from
@leo_nid_k
but swapped to splat version for fast cuda code. Def check out Leoโs paper
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
Multi-Object Tracking (MOT) has been notoriously difficult to evaluate, and evaluation has been a constant source of frustration for many.
Check out this blog post ( ) which describes our recent work on the HOTA metrics for better tracking evaluation! 1/6
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
Can we track objects for which we donโt have training data?
Check out our
@CVPR
22 Oral paper
Opening Up Open World Tracking
We present a benchmark, baseline & analysis for kickstarting open-world tracking!
Oral / Poster Friday
#CVPR
#CVPR22
#CVPR2022
Check out the website and paper! There is A LOT more cool videos and results to explore!
Website:
Paper:
Also thanks to my collaborators George Kopanas (
@GKopanas
), Bastian Leibe (
@RWTHVisionLab
) and Deva Ramanan (
@RamananDeva
).
RobMOTS: The Ultimate Tracking and Video Segmentation Challenge at CVPR'21.
Deadline June 11th.
8 different benchmarks come together, to create the ultimate combined challenge.
Waymo, KITTI, BDD100K, TAO, MOTChallenge, YouTube-VOS, OVIS and DAVIS.
3D tracking results are on average only 1.5cm off from the ground-truth, in fast-moving complex scenes 150 frames long. All while densely tracking around 300k Gaussians simultaneously.
See the comparison between our tracks for certain points (blue) and the ground-truth (red).
Forecasting object locations directly from raw LiDAR is hard.
In our
@CVPR
paper, FutureDet, we repurpose 3D detection architectures for forecasting, by directly predicting โFuture Object Detectionsโ
#CVPR
#CVPR2022
#CVPR22
Excited to present 3 papers (w 2 Orals) this week
@CVPR
1 - HODOR: Video Segmentation trained without Video
2 - Open World Tracking: Tracking objects classes beyond those in training
3 - FutureDet: Reformulating forecasting as Future Detection
๐งต๐
#CVPR
#CVPR2022
#CVPR22
The core idea is enforcing that Gaussians have persistent color, opacity, and size over time; and regularizing Gaussians' motion and rotation with local-rigidity constraints. Dense 6-DOF tracking emerges from persistent dynamic view synthesis, without correspondence or flow input
4 days until the Robust Video Scene Understanding Workshop at CVPR!
8 exciting invited speakers including
@FidlerSanja
@kkitani
@judyfhoffman
@WeidiXie
, Katerina Fragkiadaki, Philipp Krรคhenbรผhl, Michael Felsberg and Lorenzo Torresani.
+ MUCH MORE..
๐งต๐
Interested in Object Tracking in 3D? Dynamic Object Reconstruction? Generic Object Discovery? Open Set Scene Understanding?
I am currently presenting two papers on these topics at (virtual) ICRA 2020.
@ICRA2020
#ICRA2020
#ICRA
Join the discussion!
1/5
Details of this can be found here:
I don't have any more time to dedicate to this, but if someone can get this to work and let me know that would be awesome!
Opening up Open-World Tracking.
Itโs impossible to label EVERY class an agent might see.
But not detecting and tracking UNKNOWN objects, may lead to DISASTER.
In Open-World Tracking, trackers trained only 80 classes need to track ANY unknown object.
3D was either pretty, or fast. Now itโs BOTH! Meet Interactive Scenes built with Gaussian Splatting:
๐ฅBrowser & Phone-Friendly: Hyperefficient and fast rendering everywhere
๐Embed Anywhere: 8-20MB streaming files (even smaller soon!)
โจUltra High Quality offline NeRF renders &
Huge Update!!!
HOTA Metrics now evaluated live on the MOTChallenge benchmark too!
And TrackEval code now the official evaluation code for MOTChallenge.
Check it out:
Another huge step toward the future of tracking research.
I'm giving a talk about Visual Object Tracking in about 2.5 hours.
I will cover advances from old-school Lucas-Kanade Template Tracking to our state-of-the-art Siam R-CNN ().
Sign up to attend here:
@rerundotio
I have a question for you guys!
Do you think I could replace open3D in my current vis pipeline with rerun?
Can rerun render a 300k point point cloud which is updated every timestep, at the 800fps I can create them?
Details and implementation here:
See
My suggestion to fix reviewing: instead of individuals reviewing have teams of authors from other submissions review. The joint authors of each submission need to review 3 papers together as a team. Potentially if they neglect their reviewing duties their submission will be desk
@s1ddok
Wholey moley this is cool!!!!
We should def chat! I want to see the dynamic stuff running on a VR headset so bad!
Also putting the view dependent effects back in shouldnโt make it any slower. It was 850 fps because I was only doing 640x360 ims.
The workshop I am organizing on Multi-Object Tracking and Segmentation has just begun!
Tune in live: (or join in on zoom - link on CVPR workshop landing page)
Exciting talks from Bernt Schiele, Raquel Urtasun, Xin Wang and Alyosha Efros
Calling all object tracking researchers!!!
Submit your trackers to our ECCV workshop challenge on Tracking Any Object (TAO) (tracking on 833 categories!)
Deadline is August 16!!! Excited to see you all present your results at our Workshop!
@VGolyanik
SceneNeRFlow is very cool work. Lots of things really similar. Literally 9 min before I uploaded my gf sent me it saying โscoopy doopy doโ. I would love to compare it to Dynamic 3D Gaussians, on both the dataset I use and the dataset they use!!
@tretschk
@MZollhoefer
@chlassner
@Shedletsky
@CoffeeVectors
The 'input' is NOT the video you see. It's actually a bunch of static cameras. Here we are reconstructing the dynamic world in a persistent way across time with a bunch of small Gaussians. This enables us to render novel views (e.g. the loop you see) and also to track all...
This enables some awesome behavior, such as being able to train WITHOUT VIDEO (from single images), or from video where only one frame is labeled.
Oral talk and poster Tuesday afternoon (NOW) โ in the Video Analysis session.
#CVPR
#CVPR22
#CVPR2022
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
Workshop begins in ~1.5 hours at 7am EDT / 1pm CEST.
Don't miss all the invited speakers + papers + challenge results!
Featured in
#CVPR
Daily magazine!
YouTube livestream open to everyone:
Schedule here:
4 days until the Robust Video Scene Understanding Workshop at CVPR!
8 exciting invited speakers including
@FidlerSanja
@kkitani
@judyfhoffman
@WeidiXie
, Katerina Fragkiadaki, Philipp Krรคhenbรผhl, Michael Felsberg and Lorenzo Torresani.
+ MUCH MORE..
๐งต๐
@CoffeeVectors
It only uses so much VRAM because it stores the training images on the GPU by default. But there is a toggle to turn this off and store them on CPU until needed. With this the amount of VRAM is VERY small.
@CorahMicah
Iโm EXTREMELY interested. There is a lot of follow up work to do, but itโll take the whole community.
Iโm excited to release the code so people can play with it. Also happy to collaborate and help people out with their own projects building on this.
@eigenhector
I think itโs really not that hard. You can see my PR on the Gaussian splatting paper which adapts the cuda code to also render depth. I have another internal version which renders median depth instead of mean depth which I find gives better geometry (no bleeding between edges)
Currently the bindings are for the underlying hash representation and super fast mlps, and there is a pytorch example for fitting to a 2D image.
I spent today trying to incorporate this into NeRF, with a little, but mostly not very much, success.
Trying to improve evaluation of Multi-Object Tracking and need your help in trying to judge what makes a good tracker.
Anyone can help, and any of your time would be appreciated.
@AjdDavison
@alzugarayign
Thanks Andrew! I remember meeting you in Sicily in I think 2018. I was and still am in awe of all your amazing work. Means a lot to me that you also like my work!
We have released a tracking evaluation codebase: TrackEval.
It contains HOTA + many other metrics, and runs on multiple benchmark formats.
It is 100% python, easy to understand and extend, and SUPER FAST (10x faster than previous evaluation code).
5/6
Also: RVSU CVPR'21 Workshop Call for Papers.
Call for submission track papers on Tracking, Video Segmentation and other aspects of Video Understanding.
Deadline June 4th.
Paper restricted to 4 pages to allow joint submission at main track conferences.
8.5 days to submit your trackers to the ultimate tracking challenge.
RobMOTS evaluating multi object tracking across 8 different benchmarks.
Val and test servers now live.
Previous metrics either overemphasize detection (MOTA), or association (IDF1) while mostly ignoring the other. HOTA is designed to evenly balance between both of these.
More details in our IJCV paper (open access):
3/6
Ever wondered what the world looks like to your pet dog? Our latest
#ICCV2023
paper, Total-Recon, enables embodied view synthesis of deformable scenes from a casual RGBD video:
Drop by poster
#10
on Friday 10:30~12:30pm in Rm. Foyer Sud to know more!
1/2
In our new paper, "Customizing Motion in Text-to-Video Diffusion Models" we show a method for introducing novel motions into text-to-video diffusion models.
Given a few examples of a novel motion and a generic description, our method creates a new text mapping in the network.
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
We model the world as a set of 3D Gaussians that move & rotate over time. This extends Gaussian Splatting to dynamic scenes, with accurate novel-view synthesis and dense 3D trajectories.
@savvyRL
I think this is more human bias. I think human AC's are doing most of the heavy lifting here vs a matching algo. Ofc the AC's suggestions are influenced by the matching algos recommendations also.
HOTA is calculated by combining three IoU scores: one for each of detection, association and localization!
HOTA combines these into one score, while also allowing analysis of each, and further dividing each of these into a recall and precision component. 2/6
4 days until the Robust Video Scene Understanding Workshop at CVPR!
8 exciting invited speakers including
@FidlerSanja
@kkitani
@judyfhoffman
@WeidiXie
, Katerina Fragkiadaki, Philipp Krรคhenbรผhl, Michael Felsberg and Lorenzo Torresani.
+ MUCH MORE..
๐งต๐
@Shedletsky
@CoffeeVectors
yes there are 27 input cameras in a semi-circular dome facing inward, and we accurately know the positions of them. This definitely makes the problem easier compared to having fewer cameras and not knowing where they are. There is lots of interesting future work to do for sure!
Is there is an app that exports RGB-D video from IPad Pro w/LiDAR.
Don't want meshes or pt clouds. Just RGB video + greyscale depth video.
Lots of scanning apps. None give RGB-D. Could write exporter from but better if don't have to. Maybe
@nobbis
knows?
@eigenhector
Eg I understand why one would want to mesh an mlp based nerf - because mlps suck to deal with. But Gaussians already intrinsically have all the properties one would want from a geometry representation in my opinion.
What's it like to be a
#PhD
student in
#Germany
? You can get paid, and well. You can afford a car, an apartment, and provide for a family. You may work with great advisors at great institutions. And the food... well, the food. Read on!
@keenanisalive
@akanazawa
@Jimantha
This requires a single image as input to generate and infinite perceptual scene. But that single image could be generated from a standard image GAN so then it could be fully automatic.
HOTA has now launched live as the official metrics for KITTI tracking and KITTI MOTS.
This will open many new opportunities for developing trackers.
Tracking:
MOTS:
4/6
@miguel_algaba
@JSelikoff
Yeah! At the moment I have only run it on scenes with multiples cameras (these real scenes have 27 train cameras, and some synthetic scenes have 20). I donโt know how well it would work with less. Also note calibrating multiple cameras in the wild is really hard.
@giffmana
Hahaha these Gaussians arenโt too different to triangles tbh. The magic is that they are nice and differentiable so that we can easily fit them from real data with diff rendering.
@smallfly
@AceOfThumbs
@Scobleizer
I am open sourcing the splatting viewer I built in the next few days. Should be very easy to input a list of extrinsic matrices and output a video. Or alternatively control the camera path interactively.
Surprising to hear that INGP training is faster for you. This can be fixed.
Thanks to
@kangle_deng
for contributing to this code today and fixing some of my bugs!
Thanks to
@yen_chen_lin
for the great NeRF pytorch implementation which makes it easy to build upon.
Finally, thanks to Thomas Mรผller for the super fast CUDA code and pytorch bindings.
Kris Kitani (
@kkitani
) is live now.
Giving a keynote on "Perception + Prediction for autonomous driving."
(open for everyone).
Or through
#CVPR
website for the zoom link.
4 days until the Robust Video Scene Understanding Workshop at CVPR!
8 exciting invited speakers including
@FidlerSanja
@kkitani
@judyfhoffman
@WeidiXie
, Katerina Fragkiadaki, Philipp Krรคhenbรผhl, Michael Felsberg and Lorenzo Torresani.
+ MUCH MORE..
๐งต๐
@hegemonetics
hahaha yeah 850 fps is ALOT. But note that these are 640x360 images. It's a bit slower at full HD but it's still REALLY fast!
1280x720: 400 fps
1920x1080: 250 fps
@giffmana
I found it!
Seems to work super nice for RGB-D. Can export as mp4 and even do streaming over usb or wifi!
Almost perfect. Gives intrinsics but doesn't seem to give extrinsics / camera pose...
Keeping an eye out for something that does both.
@janusch_patas
@jonstephens85
@RobMakesMeta
@Scobleizer
If I prioritise code release over other things I wanted to do I might be able to release in ~1 week. Otherwise Iโll be camping in the Andes and itโll have to wait for Oct 1 ish. But would like to get peopleโs hands on it.
@smallfly
@AceOfThumbs
@Scobleizer
Is it possible to make a side-by-side with the same camera path (the nerf camera path was better). This would be quite illuminating. Also make sure to mention the training time and video rendering time for each. Should be significantly different.
@janusch_patas
@nobbis
@JulienBlanchon
@antimatter15
I think the point is the MIT licence applies to the code that they wrote (eg what is in the repo). The code that they use from other repos (eg diff Gaussian rasterisation) is not mit. Thus you canโt actually run the code commercially, but you may use the new part of the code.
I
@CVPR
I will be presenting the oral presentation at around 1.50 in the datasets track (in 40 minutes), and after that come talk to us all at the poster at 3pm (poster 35b)
We also show current end-2-end forecasting evaluation metrics and severely gameable, and present a better suite of evaluation metrics.
Come check out our poster
@cvpr
on Friday morning!
Forecasting object locations directly from raw LiDAR is hard.
In our
@CVPR
paper, FutureDet, we repurpose 3D detection architectures for forecasting, by directly predicting โFuture Object Detectionsโ
#CVPR
#CVPR2022
#CVPR22
Excited to present our
@CVPR
Oral paper HODOR
Typically, Video Object Segmentation methods learn low-level pixel correspondence.
Instead, we use transformers to extract high-level object embeddings, that can be used to re-segment objects through video.
@HelgeRhodin
This looks like cool work!!! I wonder if there are things we could combine from your work to make the current Dynamic 3D Gaussians even better?
The core idea is enforcing that Gaussians have persistent color, opacity, and size over time; and regularizing Gaussians' motion and rotation with local-rigidity constraints. Dense 6-DOF tracking emerges from persistent dynamic view synthesis, without correspondence or flow input