[IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (#143986)

Changes to scale opcodes, types and args once in `IR2VecVocabAnalysis` so that we can avoid scaling each time while computing embeddings. This PR refactors the vocabulary to explicitly define 3 sections---Opcodes, Types, and Arguments---used for computing Embeddings. (Tracking issue - #141817 ; partly fixes - #141832)
2025-06-30 23:09:19 +02:00
parent 56ef00a59d
commit 0745eb501d
17 changed files with 384 additions and 157 deletions
--- a/llvm/docs/MLGO.rst
+++ b/llvm/docs/MLGO.rst
@@ -448,7 +448,16 @@ downstream tasks, including ML-guided compiler optimizations.

 The core components are:
  - **Vocabulary**: A mapping from IR entities (opcodes, types, etc.) to their
-    vector representations. This is managed by ``IR2VecVocabAnalysis``.
+    vector representations. This is managed by ``IR2VecVocabAnalysis``. The 
+    vocabulary (.json file) contains three sections -- Opcodes, Types, and 
+    Arguments, each containing the representations of the corresponding 
+    entities.
+
+    .. note::
+
+    It is mandatory to have these three sections present in the vocabulary file 
+    for it to be valid; order in which they appear does not matter.
+
  - **Embedder**: A class (``ir2vec::Embedder``) that uses the vocabulary to
    compute embeddings for instructions, basic blocks, and functions.