Skip to content

How Does the Ai Assistant Tokenize the Files

Vertexgraph AI assistant uses the technique called tokenize to break down the articles into small pieces. The process of tokenization in the context of a Large Language Model (LLM) in general, involves breaking down a piece of text into smaller units called tokens. Tokens are typically words or subword units like subword pieces in byte-pair encoding (BPE) or SentencePiece. In this blog, we will talk about how this process works in detail.

  1. Preprocessing: The first step is to preprocess the input text so that the input can be in a uniform format. The preprocessing involves tasks like lowercasing, removing punctuation, and handling special characters.

  2. Tokenization: The preprocessed text is then tokenized. In traditional NLP models, tokens are typically words. For example, the sentence "I love cats" would be tokenized into three tokens: "I," "love," and "cats." However, the Vertexgraph AI assistant uses subword tokenization which means breaking text into smaller units that are not necessarily full words. Subword tokenization is especially useful for handling rare or out-of-vocabulary words. For example, in subword tokenization, the word "unhappiness" might be broken down into "un," "happiness," or even smaller units like "unh" and "appi" if they are common subword pieces in the training data.

  3. Vocabulary: Vertexgraph AI assistant has a fixed vocabulary of subword tokens. During tokenization, the text is segmented into these subword tokens. The vocabulary is usually quite extensive to handle a wide range of text.

  4. Encoding: After tokenization, the subword tokens are converted into numerical embeddings, allowing the model to process them. These embeddings represent the tokens as high-dimensional vectors.

  5. Feeding into the model: The tokenized and encoded input is then passed into the LLM. The model processes the input tokens in context to generate responses or make predictions.

In conclusion, the goal of tokenization is to represent text in a format that the model can understand and work with effectively while preserving the contextual meaning of the text.