Zihang Dai @ZihangDai Twitter profile

Last Seen Profiles

@TheBigSynz

@silkenhound

@CFRTELA

@sycophantastic

@JakeCanuso

@SequoiaRecords

@daiichi10_group

@Shay55A

@incogwheel

@peteryared

@Basilkdb17

@cryptoMaiQ

@IntuitionSpirit

@sensorscars

@vinagapi

@benhosley

@giuseppegsv

@PredX_AI

@dcrsvtwc

@cleanhearts

@V10ma

@silent_loong

@HylanKhald24680

@maora_twitch

@Scumtk

@do_byunni

@saliba

@KingSnarky

@KhoraBee

@TorreblancaEma_

@VNEXT_AU

@monstera2222

@efutoran

@ittasite

@wpmudev

@ludmonet

Zihang Dai

@ZihangDai

8 months

Haven’t been in conference for a while. Looking forward to meeting friends on Thursday!

Jimmy Ba

@jimmybajimmyba

8 months

Excited to arrive at NeurIPS later today alongside some of my colleagues. @xai / @grok crew will have a Meet & Greet session on Thursday at 2:30pm local time by the registration desk. Drop by for some fun, giggles, and good roasts!

294

202

693

243

157

210

Zihang Dai

@ZihangDai

4 years

In NLP, the O(TD^2) linear projections in Transformer often cost more FLOPs than the O(T^2D) attention as commonly D > T. While many efforts focus on reducing quadratic attention to linear O(TKD), our Funnel-Transformer explores reducing T with clear gains:

7

31

163

Zihang Dai

@ZihangDai

4 years

@srush_nlp

2

3

88

Zihang Dai

@ZihangDai

4 years

@srush_nlp Thanks for the donation.

2

3

80

Zihang Dai

@ZihangDai

4 years

In practice, one should also consider tuning the hidden dimension "D", given its significant effect on the FLOPs. Some facts: - All datasets in GLUE only require T = 128. SQuAD, RACE, and many RC datasets require T = 512. - D = 512/768/1024 for mobile-BERT/BERT-base/BERT-large

15

2

20

Zihang Dai

@ZihangDai

4 years

@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) From the bias-variance trade-off perspective, Transformer has a "weaker" model bias (2) Key success factors: (a) a deep-thin TFM rather than a shallow-fat TFM; (b) *copy* AWD-LSTM regularization (3) After (2), the variance is not high

2

1

15

Zihang Dai

@ZihangDai

5 years

@ZhitingHu Nice work and interesting results. It may be better and easier to use Transformer-XL for generation ().

GitHub - kimiyoung/transformer-xl

Contribute to kimiyoung/transformer-xl development by creating an account on GitHub.

github.com

1

13

Zihang Dai

@ZihangDai

4 years

@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) It is mentioned in the paper "Similar to AWD-LSTM (Merity et al., 2017), we apply variational dropout and weight average to Transformer-XL". (2) From AWD-LSTM, it took me less than a week to get under 60 PPL (3) I got the final PPL within 2 weeks with 4 GPUs on my own machine

1

2

12

Zihang Dai

@ZihangDai

4 years

@srush_nlp @JesseDodge @ssgrn @nlpnoah For (2), I only meant the key success factors for **small datasets like PTB**.

1

0

2