Zihang Dai Profile
Zihang Dai

@ZihangDai

15,938
Followers
206
Following
1
Media
18
Statuses

Working hard @xai

Bay Area
Joined March 2012
Don't wanna be here? Send us removal request.
@ZihangDai
Zihang Dai
8 months
Haven’t been in conference for a while. Looking forward to meeting friends on Thursday!
@jimmybajimmyba
Jimmy Ba
8 months
Excited to arrive at NeurIPS later today alongside some of my colleagues. @xai / @grok crew will have a Meet & Greet session on Thursday at 2:30pm local time by the registration desk. Drop by for some fun, giggles, and good roasts!
294
202
693
243
157
210
@ZihangDai
Zihang Dai
4 years
In NLP, the O(TD^2) linear projections in Transformer often cost more FLOPs than the O(T^2D) attention as commonly D > T. While many efforts focus on reducing quadratic attention to linear O(TKD), our Funnel-Transformer explores reducing T with clear gains:
Tweet media one
Tweet media two
Tweet media three
7
31
163
@ZihangDai
Zihang Dai
4 years
2
3
88
@ZihangDai
Zihang Dai
4 years
@srush_nlp Thanks for the donation.
2
3
80
@ZihangDai
Zihang Dai
4 years
In practice, one should also consider tuning the hidden dimension "D", given its significant effect on the FLOPs. Some facts: - All datasets in GLUE only require T = 128. SQuAD, RACE, and many RC datasets require T = 512. - D = 512/768/1024 for mobile-BERT/BERT-base/BERT-large
15
2
20
@ZihangDai
Zihang Dai
4 years
@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) From the bias-variance trade-off perspective, Transformer has a "weaker" model bias (2) Key success factors: (a) a deep-thin TFM rather than a shallow-fat TFM; (b) *copy* AWD-LSTM regularization (3) After (2), the variance is not high
2
1
15
@ZihangDai
Zihang Dai
5 years
@ZhitingHu Nice work and interesting results. It may be better and easier to use Transformer-XL for generation ().
1
1
13
@ZihangDai
Zihang Dai
4 years
@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) It is mentioned in the paper "Similar to AWD-LSTM (Merity et al., 2017), we apply variational dropout and weight average to Transformer-XL". (2) From AWD-LSTM, it took me less than a week to get under 60 PPL (3) I got the final PPL within 2 weeks with 4 GPUs on my own machine
1
2
12
@ZihangDai
Zihang Dai
4 years
@srush_nlp @JesseDodge @ssgrn @nlpnoah For (2), I only meant the key success factors for **small datasets like PTB**.
1
0
2