[IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (#143986)

Changes to scale opcodes, types and args once in `IR2VecVocabAnalysis` so that we can avoid scaling each time while computing embeddings. This PR refactors the vocabulary to explicitly define 3 sections---Opcodes, Types, and Arguments---used for computing Embeddings. 

(Tracking issue - #141817 ; partly fixes - #141832)
This commit is contained in:
S. VenkataKeerthy
2025-06-30 23:09:19 +02:00
committed by GitHub
parent 56ef00a59d
commit 0745eb501d
17 changed files with 384 additions and 157 deletions

View File

@@ -448,7 +448,16 @@ downstream tasks, including ML-guided compiler optimizations.
The core components are:
- **Vocabulary**: A mapping from IR entities (opcodes, types, etc.) to their
vector representations. This is managed by ``IR2VecVocabAnalysis``.
vector representations. This is managed by ``IR2VecVocabAnalysis``. The
vocabulary (.json file) contains three sections -- Opcodes, Types, and
Arguments, each containing the representations of the corresponding
entities.
.. note::
It is mandatory to have these three sections present in the vocabulary file
for it to be valid; order in which they appear does not matter.
- **Embedder**: A class (``ir2vec::Embedder``) that uses the vocabulary to
compute embeddings for instructions, basic blocks, and functions.