Bilzard @bilzrd Twitter profile

Last Seen Profiles

@Orhnapydn1

@Avalysion

@storiedamediano

@michaelshermer

@CoachRich_G

@GazeteThe

@cupel63902775

@penyukastw21

@bokeplokalmalam

@GirlsvilleRecs

@MaHer0120

@GrizzlyBrew_

@jandakembangstw

@soobiync

@alsmrany_ahmd

@SecDef

@JRCherry3

@Bayram54306977

@CaseEsportsBR_

@boocutieWah

@hobi_wassi85769

@squriby01

@thechop333

@n_iwk0416

@600Playboy

@tothemerciless

@ravi_rei0

@Samanyolu___58

@SITE_vl

@OrderToad

@kataribe_yakata

@cupel63902775

@Aceng38562081

@Suhanowailaty

@Connectome_CNTM

@m_s_alsaadi

Bilzard

@bilzrd

8 months

松尾研による「国際的な学会誌に論文を通すための極意」的な文章。細かい内容まで具体的に言語化されていて参考になった。

1

138

1K

Bilzard

@bilzrd

2 years

これクッソ早かった。pytorchのdataloaderが200M件のデータのロードに3分かかったのに対し、merlin-dataloaderだとわずか500msだった。400倍の高速化。

GitHub - NVIDIA-Merlin/NVTabular: NVTabular is a feature engineering and preprocessing library for...

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems...

github.com

2

97

643

Bilzard

@bilzrd

4 months

これはハマる。

1

51

536

Bilzard

@bilzrd

11 months

LLM関連の論文を読む順番としては、GPT-1 -> GPT-2 -> GPT-3 -> InstructGPT -> GPT-3.5 -> GPT-4がおすすめ。Llamaなど後継のLLMは基本的にGPTの論文にしたがっているので最初にGPTの論文を読んでおくと理解しやすい。

1

28

381

Bilzard

@bilzrd

2 years

拡散モデルからDALL-E2に至るまでの進化の過程が分かりやすく説明されている。思った以上にステップ数があったので、これなしで一人で全部論文を追うことを考えると非常に助かった。

【AI論文解説】DALL-E 2：文章に沿った画像を高品質かつ多様に生成 -概要編-

次の動画（詳細編）：https://youtu.be/kF3v3_hsWUQ文章に沿った画像を高品質かつ多様に生成することが可能なDALL-E 2の概要について紹介しています。DALL-E 2のプロジェクトページへのリンク：https://openai.com/dall-e-2/--ソニーが提供するオープンソース...

www.youtube.com

1

48

359

Bilzard

@bilzrd

2 years

GNNについて仕組みから応用分野まで幅広くカバーした資料。グラフ表現が画像やテキストなど日常的なタスクを包含する一般的な表現である、という説明が目から鱗だった。学術論文よりは平易な文章で書かれていて、図も豊富にあるので、イメージを掴むには分かりやすい。

A Gentle Introduction to Graph Neural Networks

What components are needed for building learning algorithms that leverage the structure and properties of graphs?

distill.pub

0

37

348

Bilzard

@bilzrd

1 month

Transfomerのsoftmax attentionはカーネル法の枠組みで再解釈すると、Q/Kのある射影と乱数を使ってモンテカルロ法で近似計算できる。本研究はDCTによる射影と学習可能な定数によってQ/Kの射影を計算することで性能はそのままに学習速度を飛躍的に向上したと主張する。

DiJiang: Efficient Large Language Models through Compact Kernelization

In an effort to reduce the computational load of Transformers, research on linear attention has gained significant momentum. However, the improvement strategies for attention mechanisms typically...

arxiv.org

1

22

228

Bilzard

@bilzrd

1 year

Adamのアルゴリズムの意味を整理してた。 Kaggleやってたら大抵の人が使ったことがあるoptimizerだが、パラメータの意味やアルゴリズムの詳細を知らずに使ってる人もいるのではないだろうか（自分はそうだった）。

1

27

228

Bilzard

@bilzrd

9 days

TransformerのattentionはKernel法で定式化できるよね？を提示した先駆的研究。後続の研究で明に議論されてない（ゆえにたまに見落とされてる）maskingについても言及している。

Transformer Dissection: A Unified Understanding of...

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction....

arxiv.org

1

21

207

Bilzard

@bilzrd

11 months

最近読んだLLM関連の論文をまとめた。fine-tuning周りは提案手法の多さに比べて自分の読めている論文の範囲が少ないので除外している。

[survey] 近年のLLMに関する提案手法について

zenn.dev

0

24

206

Bilzard

@bilzrd

2 years

引退したGrandmasterがKaggleの良い面と悪い面について自身の所感をまとめたブログ。現実的で参考になった。

Are Kaggle Competitions Worth It? Ponderings of a Kaggle Grandmaster

I would not have a data science career without Kaggle. So if you are looking for a blog post bashing Kaggle, this is not the place. Competing on Kaggle is worth it! It will teach you a lot about...

forecastegy.com

0

20

177

Bilzard

@bilzrd

2 years

Polars、applyを使わずにexpressionで書くよう気をつけていれば勝手に並列実行してくれるのでかなり作業効率が改善した。

1

18

175

Bilzard

@bilzrd

11 months

言語モデルの事前学習で「モデルサイズを大きくするにつれてバッチサイズを大きく、learning rateを小さく」というのが慣例になっているが、OpenAIのモデルでこれについて初めて言及したのがGPT-3で、gradient noise scaleという同社の先行研究による知見を反映したらしい。

2

25

160

Bilzard

@bilzrd

1 year

LLMコンペ、推薦勢もぜひやってみて欲しい。事前学習でモデルが学んでない知識をどのようにして与えるか？というのはLLMが隆盛の時代においても重要なテーマで、現状まだ未開拓な部分が多く発展の余地がある。 keywords: RAG(Retrieval-augmented generation)

1

7

155

Bilzard

@bilzrd

2 years

pytorchのデータローダで大量のpythonオブジェクト(list, dict etc.)を参照する場合に、オブジェクトの参照カウントがプロセス固有のメモリ空間に複製され、合計メモリ使用率を不必要に消費する問題の原因と対処方法について詳細に書かれたブログ。

Demystify RAM Usage in Multi-Process Data Loaders

A typical PyTorch training program on 8 GPUs with 4 dataloader workers per GPU would create at least processes. A naive use of PyTorch dataset and dataloader can easily replicate your dataset's

ppwwyyxx.com

2

25

144

Bilzard

@bilzrd

1 year

Cross validationでナイーブに計算したメトリクスの信頼区間は、真の信頼区間に比べてずっとせまい(=over confident)ことを証明した論文(らしい)。

Cross-validation: what does it estimate and how well does it do it?

Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the...

arxiv.org

1

18

139

Bilzard

@bilzrd

1 year

loss=NaNで死んだ事象と対応について記事にまとめました。

半精度、混合精度学習の安定性とepsについて

zenn.dev

0

20

133

Bilzard

@bilzrd

1 month

KAN (Kolmogorov-Arnold Netrork) NNを関数近似としてみた場合、MLPの場合はaffine変換+非線形activationを基本単位としている。この基本単位は学習可能なパラメータを持つ1変数関数で置き換えても概ね実用的な関数では問題ないと考えられる。

1

19

131

Bilzard

@bilzrd

2 years

Box Embedding 点でなく箱の埋め込み表現を考えることで、概念の広がりや概念同士の包含関係を表現することを目指す。 billionオーダーのデータで教師なしで学習でき、下位タスクでWord2Vecを上回ったとのことなので、近い将来BERTやGPTに実装されることもあるかも。

単語を箱で表現！新たな埋め込み手法 Box Embedding を基礎から理解

近年、単語をベクトル表現する埋め込み技術が広く用いられていますが、単語を「点」でしか表現できず、概念の広がりや階層関係を表現できないという問題がありました。そこで、データを「箱」などの領域によって表現する埋め込み手法が研究されています。本記事では、データを箱として表現し、ベクトルよりも強力な埋め込みを学習する「Box Embedding」について基礎から丁寧に解説します。

ja.stateofaiguides.com

0

23

130

Bilzard

@bilzrd

1 month

なお、「QKを先に計算するのでなくKVを先に計算することでトークン長に対して線形計算量を実現する」アイデアは割と一般的な手法らしい。図はcosformerのもの。

0

14

126

Bilzard

@bilzrd

2 years

私がこれまでKaggleコンペで使ったLB Probing手法についてまとめました。

今までにKaggleコンペで使ったLB Probing手法について

zenn.dev

0

6

127

Bilzard

@bilzrd

2 years

decord - 動画ファイルを機械学習向けに効率的にロードするために作られたフレームワーク。ランダムアクセスが高速で、トータルロード時間もOpenCVより数倍高速らしい。

GitHub - dmlc/decord: An efficient video loader for deep learning with smart shuffling that's super...

An efficient video loader for deep learning with smart shuffling that's super easy to digest - dmlc/decord

github.com

0

26

123

Bilzard

@bilzrd

1 year

Posted to Hatena Blog #はてなブログバッチサイズと学習率の関係について - 機械学習の詰め合わせ

バッチサイズと学習率の関係について - 機械学習の詰め合わせ

「学習率をバッチサイズに比例して増やす」（ルール1）という慣例があるが、これの根拠について説明する。 ��拠まず、このアイデアの出所は論文[1]だと思われる。この論文では「学習率をバッチサイズに比例して増やす」というルールを適用すると幅広いバッチサイズで収束後の性能が良くなることを経験的に示した。論文ではこの法則…

bilzard.hatenablog.com

0

10

120

Bilzard

@bilzrd

11 months

self-attentionの代替手法について、過去に取り上げた論文を中心に超ミニマムなsurveyを書いた。

self-attentionを代替する各種手法について

zenn.dev

0

9

115

Bilzard

@bilzrd

1 year

LoRA LLMを下位タスクにfine tuningするさい、元のモデルと同じサイズのパラメータ数と勾配更新が必要になる。提案手法は事前学習済みモデルの重みを固定し、重みの差分を1-4次元の極めて低次元の空間に押し込めることでメモリ消費とパラメータ数の劇的な削減に成功した。

1

19

111

Bilzard

@bilzrd

7 days

ソニーさんの解説、相変わらず分かりやすい。アーキテクチャの説明がより詳細にされてた。

【AI論文解説】ロボット版GPT! Robotics Transformer (RT-1)

【AI論文解説】はディープラーニング・機械学習に関する論文を紹介する動画シリーズです。（プレイリスト： https://www.youtube.com/playlist?list=PLbtqZvaoOVPCqfmnrBfo9Xv5mtDr0LjQZ ）今回は、ロボット版GPTともいえるRobotics Trans...

www.youtube.com

1

11

114

Bilzard

@bilzrd

1 month

著者らの実装よりも効率的なKANのバージョンが公開されている。

GitHub - Blealtan/efficient-kan: An efficient pure-PyTorch implementation of Kolmogorov-Arnold...

An efficient pure-PyTorch implementation of Kolmogorov-Arnold Network (KAN). - Blealtan/efficient-kan

github.com

1

9

107

Bilzard

@bilzrd

1 year

LLMの生成結果をより「事実に基づいて」生成させることを目的とした手法。コンセプトとしては「意味など知識に基づいた予測をする深いレイヤ」の予測結果から、「文法などの形式的な知識に基づく予測をする浅いレイヤ」の予測結果の影響を差し引くことで実現する。

DoLa: Decoding by Contrasting Layers Improves Factuality in Large...

Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple...

arxiv.org

1

15

107

Bilzard

@bilzrd

1 month

softmaxは「attentionの重みが1になる」という、解釈性と安定性のために導入したので、これがなくてもattentionの代替はできると思われる。だったらDCTを使わなくてももっとシンプルな計算で右の形式に変換できるような気もする。

Bilzard

@bilzrd

1 month

これがその図（論文より抜粋）。

1

0

5

1

8

102

Bilzard

@bilzrd

2 years

擬似ラベルのデータリーク、図にするとこういうこと。

1

8

102

Bilzard

@bilzrd

2 years

PANNs: 音声認識のための学習済み音声ニューラルネットワーク本論文はYouTubeからダウンロードした大規模なdatasetを使ってCNNの重みを学習し、さまざまな下位タスクへの転移学習を可能にすることを意図して提案。既存の研究では小規模なdatasetを使ったものしかなかった。

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio...

Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech...

arxiv.org

1

8

99

Bilzard

@bilzrd

2 years

金融や医学用途など、分類モデルのconfidenceが実際の確率に一致していて欲しい場合がある。浅いモデルではこの性質をよく満たすのに対し、深層学習においては成り立つとはかぎらない。そこでconfidenceを実際の確率と一致するように後処理する手法がある。

Should Ensemble Members Be Calibrated?

Underlying the use of statistical approaches for a wide range of applications is the assumption that the probabilities obtained from a statistical model are representative of the "true"...

arxiv.org

1

13

99

Bilzard

@bilzrd

11 months

テキスト生成タスクにおいて、確率を最大にするタイプのdecoding手法では生成されたテキストが不自然だったり、表現の繰り返しが多用されることが知られている。本論文はこの原因が生成文中のトークンのembeddingが非常によく似ていることに起因するとし、

A Contrastive Framework for Neural Text Generation

Text generation is of great importance to many natural language processing applications. However, maximization-based decoding methods (e.g. beam search) of neural language models often lead to...

arxiv.org

2

8

96

Bilzard

@bilzrd

1 year

Sophiaの論文読んだ（推定パート以外）。 SGDやAdamが不得意なパラメータごとの曲率の違いに適用するには二次微分(Hessian)を計算する必要があるが、一般的に高コストなので、統計的な推定値で代替しようというもの。明快に書かれていて読みやすかった。

1

13

93

Bilzard

@bilzrd

3 years

上位に日本人がいすぎて日本で位置推定が流行ってると思われてて草。

Google Smartphone Decimeter Challenge

Improve high precision GNSS positioning and navigation accuracy on smartphones

www.kaggle.com

0

17

94

Bilzard

@bilzrd

1 year

Transformerのnormの位置に関する論文。最初に提案されたアーキテクチャ(post-norm)ではwarmupステップが必要で、パラメータに敏感だが、提案されたアーキテクチャ(pre-norm)ではwarmup不要、パラメータに敏感でなくなり、かつ収束も早くなるというもの。

0

12

93

Bilzard

@bilzrd

2 years

疑似ラベルの性能についてはここですでに詳しく検証されている。

CIFAR-10を疑似ラベル（Pseudo-Label）を使った半教師あり学習で分類する - Qiita

TL;DR半教師あり学習の1つの手法である、疑似ラベルをCIFAR-10で試したサンプル数が少ない場合は、疑似ラベルを使うことでテスト精度を引き上げることができたただし、転移学習と比べると若干…

qiita.com

1

8

92

Bilzard

@bilzrd

2 years

みんなもっと使おうnb_black。 Jupiter環境でセル中のコードのスタイルの整形を自動でやってくれます。

GitHub - dnanhkhoa/nb_black: A simple extension for Jupyter Notebook and Jupyter Lab to beautify...

A simple extension for Jupyter Notebook and Jupyter Lab to beautify Python code automatically using black. - dnanhkhoa/nb_black

github.com

1

4

89

Bilzard

@bilzrd

11 months

既存のLLMのscaling rawは学習データを1度しか利用しない前提で導出されたものだが、このまま学習データの増加が続けばインターネットのデータが枯渇することも懸念される。本論文では限られたデータを繰り返し利用することを前提にした修正版scaling rawについて調査した。

Scaling Data-Constrained Language Models

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by...

arxiv.org

2

12

88

Bilzard

@bilzrd

11 months

推薦タスクで単一のretrieverに頼るのは悪手である場合が多いよう（Ottoもそうだった）。LLMの解法見てもRAGをアンサンブルするのが主流のよう。

1

6

89

Bilzard

@bilzrd

2 months

IPython notebookでメモリが解放されない謎事象は頻繁に経験するのだけど、最後の出力をキャッシュしてるの罠すぎる。そもそもこんな機能あるの知らなかったしデフォルトで無効化しても良いかも。

1

11

88

Bilzard

@bilzrd

1 year

LLM以前のfine-tuningは「事前学習になかった知識を与える」イメージだったが、LLMにおけるfine-tuningは「知識はすでに持っているが、生成する内容をタスクに合わせて人間の好む形に取捨選択する」イメージが強くなった。

0

10

86

Bilzard

@bilzrd

1 year

Early/Late Dropout: Dropoutは多くの人が直感的に「乱雑さを加えることで正則化する手法」と信じているが、本論文では学習の初期段階においては、逆に乱雑さを抑え、trainデータにfitする能力を高める効果があることを示した。

1

11

86

Bilzard

@bilzrd

2 years

データのバージョンをgitみたいに管理できるツール 👀

Versioning Data and Models

Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.

dvc.org

1

14

84

Bilzard

@bilzrd

2 years

「pipは依存性解決する仕組みがない」と書こうとしたら、20.3以降は依存性解決する仕組みがあるっぽい。生半可な知識で記事書けないよ。。

0

14

82

Bilzard

@bilzrd

11 months

StreamingLLM LLMのコンテクスト長の問題でbreak throughがあった。この論文はLLMの最初の1-4トークンにattentionが集中する現象(attention sink)を発見し、これらのトークンのattentionを保持し続けばsliding windowによるdecodingで性能劣化が起こらないことを示した。

1

13

82

Bilzard

@bilzrd

10 days

vLLM - LLMをサーバ化するやつ RTX3090を使って8BパラメタのLLMを100ユーザ同時接続して13tokens/secで返せるらしい。小規模チームの社内アプリくらいの負荷は余裕をもって捌けるのかな。

0

10

83

Bilzard

@bilzrd

1 year

half precision trainingでハマりやすい罠だと思うんだけど、 eps=1e-8は0に丸められるらしい。

Adam+Half Precision = NaNs?

It’s probably a 0 division somewhere. Have you tried using a much larger eps (say 1e-4)? The default 1e-8 is rounded to 0 in half precision.

discuss.pytorch.org

1

12

82

Bilzard

@bilzrd

4 months

気づいたらpytorch公式からtorchevalなるプロジェクトが生えていた。手取り早くGPUでメトリクス計算したい時に良い。まだalpha版とのことだが、公式だけでエコシステム完結するの良い。

GitHub - pytorch/torcheval: A library that contains a rich collection of performant PyTorch model...

A library that contains a rich collection of performant PyTorch model metrics, a simple interface to create new metrics, a toolkit to facilitate metric computation in distributed training and tools...

github.com

0

5

80

Bilzard

@bilzrd

11 months

Llama2-7Bモデルから特定のターゲット文章に関する知識を忘れる手法についての論文。 GPT-4でターゲット固有の固有名詞や表現をより一般的に(generic)な表現に置き換えた辞書を作成し、ターゲット固有の表現を出力しにくくなるラベルを作成してfine-tuningする。

1

13

77

Bilzard

@bilzrd

11 months

LLMの解法のまとめを共有しようと思ったが、実際の解法を読んでもらう方が良いと思ったのでやめた。自分なりの総括はこんな感じ。

1

3

75

Bilzard

@bilzrd

11 months

ML界隈で理論的な基礎づけがしっかりした手法が経験的な手法を上回ることは稀な印象。インプットの見地から言うと読むのに時間かかる割に実用的に得るものが少なくて損した気分になる。（もちろんだから要らないとは言ってないが）

1

2

75

Bilzard

@bilzrd

11 months

今年の2月にGoogleが22BパラメータのViTを訓練したという論文を公開している。LLMならぬLVM。

Scaling Vision Transformers to 22 Billion Parameters

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers...

arxiv.org

0

9

74

Bilzard

@bilzrd

2 years

DART Boosting treeアルゴリズムにて前段の木をランダムにドロップさせることで、後段の木が少数インスタンスにのみ適合する「過度な専門化」と呼ばれる問題を緩和することを実現した手法。 NNにおけるDropoutと類似した手法。

DART: Dropouts meet Multiple Additive Regression Trees

Multiple Additive Regression Trees (MART), an ensemble model of boosted regression trees, is known to deliver high prediction accuracy for diverse tasks, and it is widely used in practice....

arxiv.org

2

8

73

Bilzard

@bilzrd

1 month

LEAPで得た知見。 NNは油断してるとターゲットの分布に基づくチート推論をしやすいらしい。同じサンプルばかり見せるのではなく、強化学習におけるエピソード学習のように復元抽出を繰り返すことで汎化能力の高いモデルが学習できる場合があるらしい。

1

3

70

Bilzard

@bilzrd

2 years

Kaggleにおけるコーディング力、普通のソフトウェアエンジニアが持ってるような保守性の高いコードを書く能力はあまり必要なく、やりたいことを短時間に簡潔に実装するコスパの高さが求められていると感じる。

1

4

69

Bilzard

@bilzrd

2 years

4年前のKaggleコンペで元1位(現在は失格)のチームが行ったチート行為の顛末をまとめた記事。最初リークとの違いが分からなかったが、スクレーピングという不正な方法で入手したデータを用いたこと、事後の発覚を防止するための各種工作によって不正と判断されたようだ。

How a Kaggle Grandmaster cheated in $25,000 AI contest with hidden code – and was fired from dream...

Pet adoption ML coder apologizes and says desire to be ranked #1 'compromised my judgement'

www.theregister.com

0

8

69

Bilzard

@bilzrd

1 month

LGBMでincremental learningができるらしい。

The Recommendation for Training Big Image Data that Cannot Fit into Memory · Issue #4672 · micros...

Summary Right now if we try to train a folder of image data that cannot fit into memory of a single machine, lightgbm is not an available option. We can see some distributed solutions for lightgbm ...

github.com

0

4

69

Bilzard

@bilzrd

1 year

独習の難しいところは、間違った知識を身につけていることに自分で気づけないこと。本人はそれが正しいと信じているが、周りから見たらただの知ったかぶりおじさんになっている。

1

15

65

Bilzard

@bilzrd

3 years

信じられないけど9位。運が良かっただけかは分からない。体壊しながらも頑張った甲斐があった。

2

0

65

Bilzard

@bilzrd

2 years

Polars, expressionはpythonオブジェクトだからこういう再起的な式もプログラマブルにかける。

1

4

65

Bilzard

@bilzrd

1 month

辛くも金圏！終了2時間前まで粘ってアンサンブルした甲斐あった。直前までモデル学習してくれたチームメンバーに感謝！

4

1

62

Bilzard

@bilzrd

2 years

bounding box annotationにおけるhuman-in-the-loopな手法の論文。人間はbounding boxを作成せずに評価のみ行うことで、人間が介入する時間の短縮とアノテーションの精度向上の両方を実現した。

We don't need no bounding-boxes: Training object class...

Training object class detectors typically requires a large set of images in which objects are annotated by bounding-boxes. However, manually drawing bounding-boxes is very time consuming. We...

arxiv.org

2

8

62

Bilzard

@bilzrd

2 years

Stochastic Gradient Boosting 著者はGBDTの生みの親の1人であるJerome H. Friedman。GDBTの各イテレーションでデータを非復元抽出することで正則化する手法。Gradient boostingやGDBT手法についても簡潔にまとめられていて参考になった。

1

9

61

Bilzard

@bilzrd

9 days

KV caching LMのデコードのさい、過去に処理したtokenのKVの値は再利用される。したがってQ/Vの計算結果をキャッシュすることでメモリ使用を犠牲にして処理速度を向上する手法。

1

3

58

Bilzard

@bilzrd

10 days

PagedAttention LLMの推論時のスループットは主にVLAMによってcapされている。OSの仮想メモリのページネーションと同様のPagedAttentionという手法を使ってQV cacheのフラグメンテーションを回避。HF transformerの100倍のスループットを実現したらしい。

1

10

58

Bilzard

@bilzrd

2 years

PyTorch 2.0が試験的リリース。モデルのコンパイル機能が追加され、既存のコードに1行追加するだけで高速化の恩恵を得られる。

0

7

57

Bilzard

@bilzrd

2 years

コンペ終了後の儀式：Trelloのボードの記念撮影。

2

0

57

Bilzard

@bilzrd

1 month

AttentionとKernel法との関連を指摘した先行研究にはCosFormerがある。この論文では「非線形なsoftmax attentionが分解できない」ことを問題視していて、Sofrmax attentionの1) 正値性と2) normalizationによる汎化性能を満たす代替の処理を提案する。

cosFormer: Rethinking Softmax in Attention

Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range...

arxiv.org

1

7

57

Bilzard

@bilzrd

8 months

畳み込みのpaddingサイズの設計とpixelの位置ずれについて

zenn.dev

0

5

54

Bilzard

@bilzrd

8 months

YOLOのように「中身良く知らんけどなんか強いから使った」みたいな解法を皆がし始めるとKaggleがつまらなくなるので、少なくとも自分のパイプラインはbuilding blockに分解してタスクごとに組み替えられるようにしておきたい。

1

53

Bilzard

@bilzrd

14 days

Sentence embeddingタスクで目的関数に複素数の相対角に基づくlossを加えることで改善したと主張する論文。著者らによるとcosine内積では0と1付近で勾配消失する問題に対処したとのこと。

AnglE-optimized Text Embeddings

High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge...

arxiv.org

1

8

52

Bilzard

@bilzrd

2 years

Kaggle docker imageにpolarsが標準でインストールされるようになった。

Release v128 - GPU · Kaggle/docker-python

[This image will be available on Kaggle Notebooks in the next few days] gcr.io/kaggle-private-byod/python:v128 == Comparing pip list --format=freeze == astroid==2.14.1 | astroid==2.14.2 ...

github.com

0

1

50

Bilzard

@bilzrd

1 month

チームメイトのパイプライン再現してて気づいたことは、「データをたくさん食わせて長時間学習させること」がNNの王道解法だということ。自分のパイプラインだとLR全部食わすと3日程度かかる計算だったので諦めていたが、アーキテクチャを簡素化すれば1日程度で十分強いモデルになった。

1

4

50

Bilzard

@bilzrd

2 years

Ottoコンペについて振り返りました。 - CVのリーク - Co-visitation matrix - プライベートシェアによるチート疑惑などについて触れています。

6日目: Ottoコンペの振り返り - 機械学習の詰め合わせ

11月から参加していたKaggleコンペOTTO – Multi-Objective Recommender System（通称Otto）の期間中にあった出来事について振り返っておく。全体の感想 CVのリークについてデータ設計について共通の設計学習スキーム（candidate generator; 1段目）学…

bilzard.hatenablog.com

0

4

50

Bilzard

@bilzrd

1 year

コンペ終わった直後に届くやつ。

1

0

49

Bilzard

@bilzrd

1 year

QLoRA 4bitに量子化されたLLMをLoRAを使って48GBメモリのsingle GPUで24hでファインチューニングに成功した。

QLoRA: Efficient Finetuning of Quantized LLMs

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance....

arxiv.org

2

10

48

Bilzard

@bilzrd

11 months

Headless Language Model 言語モデルの事前学習において最終層のprojectionとsoftmaxの計算に多くのメモリを消費する。提案手法はheadを取り払い、入力系列のembeddingを復元する形式の学習タスクに置き換えることで、訓練時間とメモリ消費を削減するというもの。

1

49

Bilzard

@bilzrd

1 year

何から手をつけたら良いか分からない人はChrisのdiscussionから入ると良いと思う。アプローチの概要を分かりやすく説明している。

Kaggle - LLM Science Exam

Use LLMs to answer difficult science questions

www.kaggle.com

0

1

48

Bilzard

@bilzrd

2 years

すみません、リンク間違ってました。 400倍の速度アップはこっちです。

GitHub - NVIDIA-Merlin/dataloader: The merlin dataloader lets you rapidly load tabular data for...

The merlin dataloader lets you rapidly load tabular data for training deep leaning models with TensorFlow, PyTorch or JAX - NVIDIA-Merlin/dataloader

github.com

0

10

48

Bilzard

@bilzrd

3 years

#JQuants ニュース分析コンペで最終提出したモデルの学習につかったnotebookを共有します。取り急ぎnotebookに説明を加えただけ、という感じですが、余裕があればもうすこしまとまったものを記載します。コンペに参加してない人用:

0

6

47

Bilzard

@bilzrd

11 months

直近で読んだ勾配ベースの最適化手法についてのまとめ。

勾配ベースの最適化手法について直近に読んだ論文のまとめ

zenn.dev

0

6

47

Bilzard

@bilzrd

1 year

ラベルのmisalignmentが起こる想定原因について説明しました。

0

3

47

Bilzard

@bilzrd

2 months

@tetsuro731 testデータの順番が場所、時系列通りだったので場所、時系列解法を望まないホストが新たなtestデータをシャッフルして2week延長→終了6日前にあるユーザが「特徴から完全に場所と時系列を特定する方法を公開」→今ココ

1

5

47

Bilzard

@bilzrd

1 year

UNETR(UNEt Transformers) 3次元ボクセル画像のセグメンテーションに開発されたU-NetとTransformerを組み合わせたアーキテクチャ。 3次元画像をパッチに分割して1次元のシーケンスにデコードし、中間の層の異なる解像度のembeddingをU-Netでデコードする。

0

7

46

Bilzard

@bilzrd

1 year

間違ってたら指摘して欲しいんですが、multi stageの予測モデルで1st stageのOOF予測値を使って2ndStageの目的変数を予測したらリークしているように見える。図は変数どうしの依存関係を表したもの。例えば赤線で示したようにfold2のラベルが2段目のfold2のOOF予測値に繋がる経路がある。

2

5

44