Gen Ai Cohort blogs

RAG Advanced Patterns and Pipelines.

Vipul kant chaturvedi — Sat, 30 Aug 2025 10:17:30 GMT

Before we explore advanced RAG patterns and pipelines, let’s quickly revisit what RAG is, why it matters, and what makes it so widely discussed.

Retrieval-Augmented Generation (RAG) is a technique that pairs large language models (LLMs) with a retrieval system. While LLMs are powerful, they’re limited to the data they were trained on and can’t inherently access your custom or domain-specific information. RAG bridges that gap by pulling relevant information from your own data sources and providing it to the model, enabling more precise and context-aware answers.

A Quick Look at How RAG Works

Indexing: Your data is first divided into smaller sections (chunks) and transformed into vector embeddings — mathematical representations of text. These embeddings are stored in a vector database for efficient searching.
Query Embedding: When a user submits a question, the query is also converted into an embedding, allowing the system to compare it against the stored data.
Retrieval and Response Generation: The system finds the most relevant chunks based on the query, retrieves them, and feeds them to the LLM. The model then uses this information to craft a tailored, context-rich response.

Why we need advanced patterns and pipelines:

We encountered several issues with RAG models, which led us to the need for advanced patterns and pipelines. These issues included:

1. Poor Input, Poor Output:
An LLM can only work with the information it’s given. If a query is unclear, nonsensical, or riddled with spelling mistakes (e.g., “asasasaass”), the system may produce equally poor results. Ambiguous prompts lead to vague or irrelevant answers.

2. Context Window Limitations:
Every language model has a fixed context window — the maximum amount of text it can process at once. If the query and retrieved documents exceed this limit, the model may miss important details or fail to give a well-rounded response.

3. Gaps in Semantic Search:
Even advanced vector search methods aren’t perfect. They may overlook subtle yet important pieces of information, especially when the dataset is large or the query is nuanced. This can lead to answers that are technically correct but incomplete.

4. Challenges in Domain Reasoning:
Words can take on very different meanings across industries — “bond” in finance vs. chemistry, for example. Without deep domain-specific knowledge, LLMs may misinterpret terms and deliver overly simplified or incorrect conclusions.

5. Hallucinations:
One of the biggest challenges in AI today is hallucination — when a model confidently generates content that isn’t accurate. This often happens when the LLM doesn’t have enough context, causing it to “fill in the blanks” with made-up details.

Advanced patterns and pipelines:

To address these limitations, we use advanced RAG patterns and pipelines designed to improve retrieval precision and output quality:

1. Query Rewriting:
LLMs can rephrase and refine user queries to make them clearer and more precise, helping the RAG system better interpret intent and return more relevant data.

2. Multi-Query Retrieval:
Instead of relying on a single query, this method creates multiple variations of the same question. Documents are retrieved for each variation, scored, and ranked, ensuring the system surfaces the most relevant and comprehensive information.

3. HyDE (Hypothetical Document Embedding):
Rather than searching directly with a raw query, HyDE generates a hypothetical answer first. This answer is embedded and used to search the vector database, often producing deeper, context-rich results.

4. Re-Ranking with Cross-Encoders:
After retrieving initial results, a cross-encoder model can re-rank them based on semantic relevance, further improving accuracy. This extra layer of ranking helps prioritize the most useful information for the LLM.

5. Iterative Retrieval (Feedback Loops):
Some pipelines use a loop where the LLM analyzes initial results, refines the query, and searches again. This iterative approach can help uncover data that might have been missed in the first pass.

In conclusion, I would like to add: Advanced RAG patterns and pipelines are pushing retrieval-augmented systems far beyond their basic form, making them smarter, more context-aware, and capable of handling complex, domain-specific use cases. By combining techniques like query rewriting, multi-query retrieval, HyDE, re-ranking, and iterative feedback loops, we can significantly boost both accuracy and reliability.

As LLMs continue to evolve, so will RAG — with trends like agent-based retrieval, dynamic context management, and multimodal search on the horizon. For now, experimenting with these advanced strategies is the best way to build RAG systems that feel less like static tools and more like powerful, adaptive assistants.

Where RAG Fails?

Vipul kant chaturvedi — Sat, 30 Aug 2025 09:33:24 GMT

Before diving into where RAG fails, let’s first understand what it is, why it’s important, and why there’s so much hype around it.

RAG stands for Retrieval-Augmented Generation. In simple terms, it combines large language models (LLMs) with a retrieval system. While LLMs are trained on vast amounts of general data, they often struggle when you need them to work with your own private or domain-specific data. This is where RAG comes in — it retrieves relevant information from your data sources and feeds it into the LLM, enabling it to generate more accurate and context-aware responses.

How RAG Works:

RAG operates in three main steps:

Indexing: First, you provide your data to the system. The data is broken down into smaller, manageable chunks, and then each chunk is converted into vectors (numerical representations). These vectors are stored in a vector database for quick retrieval later.
Query Vectorization: When a user submits a query, the system also breaks it down and converts it into a vector representation, making it easier to compare with the stored data.
Retrieval & Generation: The system searches the vector database for the most relevant chunks (usually two or three) that match the query. These chunks are then passed to the LLM, which uses them to generate a context-aware response for the user.

Core RAG Limitations:

Let’s Now Discuss the Core limitations of RAG:

Garbage In, Garbage Out: If a user gives an LLM a poorly structured or nonsensical query — like random text (“asasasaass”) or a query filled with spelling and grammar mistakes — the model is likely to produce poor-quality or meaningless responses. An unclear question often leads to a vague answer.
Limited Context Window: Every LLM has a maximum context window (the amount of text it can process at once). If a query, along with retrieved information, exceeds this limit, the model may not fully understand the question or provide a complete, contextually accurate response.
Semantic Search Gaps: LLMs can sometimes skip important information that’s relevant to a query. This issue often appears when the model’s context window is already full, making it harder to retrieve and present all necessary details accurately.
Domain-Specific Reasoning: LLMs may struggle to reason in specialized fields. In different domains, the same word can carry different meanings. Without deep domain understanding, the model may misinterpret context or provide oversimplified answers.
Hallucination: Many LLMs occasionally “hallucinate,” meaning they generate confident but incorrect information. For example, if a user asks about a topic and the model lacks sufficient context, it might invent details or rely on irrelevant information from earlier parts of the conversation.

Solutions to conquer these failures:

Query Rewriting: Using LLMs to rewrite user queries improves their clarity and precision, helping the RAG system retrieve more relevant data.
Multi-Query Retrieval: Instead of relying on a single rewritten query, we can generate multiple query variations using LLMs. The system retrieves documents for each variation, then ranks the results by relevance, ultimately providing the user with the highest-ranked, most relevant information.
HyDE (Hypothetical Document Embedding): Rather than searching the vector database directly with the user’s original query, we first ask the LLM to generate a hypothetical answer. We then embed this hypothetical response and use it to search the database. This often yields more accurate and contextually relevant results.

RAG is a powerful technique that extends the capabilities of large language models, making them more useful for domain-specific tasks and real-world applications. But as we’ve explored, it’s not a magic solution — it comes with limitations like context window size, retrieval quality, reasoning challenges, and hallucinations. The good news is that techniques like query rewriting, multi-query retrieval, and HyDE can significantly improve its performance. The future of RAG lies in combining smarter retrieval strategies with reasoning-focused AI systems, making responses not just more accurate but also deeply contextual.

Introduction to RAGs

Vipul kant chaturvedi — Wed, 20 Aug 2025 13:54:44 GMT

RAGs nowadays are very important models in the world of Large Language Models (LLMs) like ChatGPT and Gemini. They differ from these LLM models because LLMs can only generate human-like text and converse with you, but they can't provide much information about a document. In contrast, a RAG model first indexes a document you provide, allowing you to ask anything related to that document. Let's discuss this in detail.

What is RAG?

RAG stands for Retrieval-Augmented Generation. It is an advanced AI model that combines the ability to retrieve data from a document with text generation. Most LLMs are trained on pre-existing data and generate text based on that. However, with RAG, it first indexes the document or any input you provide, allowing you to ask the model anything, and it will give you answers related to that document.

Why RAG is used?

Access to recent information: LLMs have a cutoff date in their training data. RAG allows them to fetch the latest facts.
Domain-specific knowledge: You can connect RAG to private or custom datasets, like medical records or company documents.
Reduced hallucinations: LLMs sometimes make things up. With retrieval, they rely on real documents for accurate answers.
Efficiency: Instead of training or fine-tuning a huge model repeatedly, RAG adds intelligence through external knowledge sources.

How RAG Works?

There are mostly two main parts in RAG:

Indexing: First, we break our document into small chunks, then we convert these chunks into vectors and store them in our vector database.

Why Perform Vectorization?

Computers don’t understand text like humans do. Vectorization converts text into numerical vectors that capture meaning.

Example: “dog” and “puppy” will have similar vectors because they are semantically close.
These vectors allow the retriever to quickly find the most relevant text chunks when you ask a question.

Why Do RAGs Exist?

RAG exists because training a model with all possible knowledge is impossible—it would be too expensive and outdated quickly. Instead, RAG gives models a dynamic way to connect to external knowledge sources, keeping them flexible, lightweight, and up-to-date.

Why Perform Chunking?

When storing large documents, it’s not efficient to search entire books or articles. Chunking breaks documents into smaller, meaningful pieces (like 200–500 words each).

This makes retrieval faster.
Improves accuracy because the retriever can return only the most relevant passage.

So I want to conclude by saying RAG bridges the gap between regular LLMs and dynamic real-world knowledge. It helps us grasp that knowledge from documents, websites, your own text, and many more sources, providing us with the text we seek. These are the models currently in demand in the market, which many businesses want for themselves.

Building Thinking models with Chain-of-Thoughts(CoT).

Vipul kant chaturvedi — Sun, 17 Aug 2025 17:19:26 GMT

To understand this, we need to start by understanding what Chain-of-Thoughts (CoT) is and what it does. CoT is a style of system prompting where AI breaks down a problem into several parts to achieve better results and provide users with transparency about how the AI is working. In the CoT AI model, it doesn't jump directly to the result; instead, it breaks the problem into different parts.

Let's look at an example of how CoT prompting works step by step:

First, we tell the model its role and how it will work based on a three-step format: START, THINK, and OUTPUT.

      You are an AI assistant who works on START, THINK and OUTPUT format.
      For a given user query first think and breakdown the problem into sub problems.
      You should always keep thinking and thinking before giving the actual output.
      Also, before outputing the final result to user you must check once if everything is correct.

Now we define the rules for the model, specifying exactly what it must follow.

  Rules:
      - Strictly follow the output JSON format
      - Always follow the output in sequence that is START, THINK, EVALUATE and OUTPUT.
      - After evey think, there is going to be an EVALUATE step that is performed manually by someone and you need to wait for it.
      - Always perform only one step at a time and wait for other step.
      - Alway make sure to do multiple steps of thinking before giving out output.

Now, we define the output format that it should use to respond.

  Output JSON Format:
      { "step": "START | THINK | EVALUATE | OUTPUT", "content": "string" }

Now, finally, we provide an example to our models to help them understand better. The more examples we give, the more effective the prompt will be for us.

  Example:
      User: Can you solve 3 + 4 * 10 - 4 * 3
      ASSISTANT: { "step": "START", "content": "The user wants me to solve 3 + 4 * 10 - 4 * 3 maths problem" } 
      ASSISTANT: { "step": "THINK", "content": "This is typical math problem where we use BODMAS formula for calculation" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Lets breakdown the problem step by step" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "As per bodmas, first lets solve all multiplications and divisions" }
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" }  
      ASSISTANT: { "step": "THINK", "content": "So, first we need to solve 4 * 10 that is 40" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Great, now the equation looks like 3 + 40 - 4 * 3" }
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Now, I can see one more multiplication to be done that is 4 * 3 = 12" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Great, now the equation looks like 3 + 40 - 12" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "As we have done all multiplications lets do the add and subtract" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "so, 3 + 40 = 43" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "new equations look like 43 - 12 which is 31" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "great, all steps are done and final result is 31" }
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" }  
      ASSISTANT: { "step": "OUTPUT", "content": "3 + 4 * 10 - 4 * 3 = 31" }

Finally, here is the complete system prompt for Chain-of-Thought prompting:

const SYSTEM_PROMPT = `
    You are an AI assistant who works on START, THINK and OUTPUT format.
    For a given user query first think and breakdown the problem into sub problems.
    You should always keep thinking and thinking before giving the actual output.
    Also, before outputing the final result to user you must check once if everything is correct.

    Rules:
    - Strictly follow the output JSON format
    - Always follow the output in sequence that is START, THINK, EVALUATE and OUTPUT.
    - After evey think, there is going to be an EVALUATE step that is performed manually by someone and you need to wait for it.
    - Always perform only one step at a time and wait for other step.
    - Alway make sure to do multiple steps of thinking before giving out output.

    Output JSON Format:
    { "step": "START | THINK | EVALUATE | OUTPUT", "content": "string" }

    Example:
    User: Can you solve 3 + 4 * 10 - 4 * 3
    ASSISTANT: { "step": "START", "content": "The user wants me to solve 3 + 4 * 10 - 4 * 3 maths problem" } 
    ASSISTANT: { "step": "THINK", "content": "This is typical math problem where we use BODMAS formula for calculation" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "Lets breakdown the problem step by step" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "As per bodmas, first lets solve all multiplications and divisions" }
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" }  
    ASSISTANT: { "step": "THINK", "content": "So, first we need to solve 4 * 10 that is 40" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "Great, now the equation looks like 3 + 40 - 4 * 3" }
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "Now, I can see one more multiplication to be done that is 4 * 3 = 12" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "Great, now the equation looks like 3 + 40 - 12" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "As we have done all multiplications lets do the add and subtract" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "so, 3 + 40 = 43" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "new equations look like 43 - 12 which is 31" } 
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
    ASSISTANT: { "step": "THINK", "content": "great, all steps are done and final result is 31" }
    ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" }  
    ASSISTANT: { "step": "OUTPUT", "content": "3 + 4 * 10 - 4 * 3 = 31" } 
  `;

Explaining What is Agentic AI: Agents &Tools.

Vipul kant chaturvedi — Sun, 17 Aug 2025 16:43:36 GMT

Agentic AI differs significantly from traditional AI, as it has a specific goal and purpose to achieve. It can't do anything beyond its objective; it is mainly designed to fulfill its goal. In contrast, traditional AI models like ChatGPT and Gemini are more flexible. You can't limit them to a single topic, or you can set their boundaries, but with agentic AI, we can define these boundaries.

Now, let's move on to the second part: what are agents and tools in agentic AI? An agent in agentic AI is simply the AI system itself, where it maintains its environment, makes decisions, and takes actions to achieve its goal. Agents are the ones who do all the work. Tools are what agents use to perform their tasks. In technical terms, they are either API calls or function calls that help the agent achieve its goal.

Let's now understand why, in today's world, agentic AI models are more profitable than traditional AI models. Most of us know that traditional AI models work on pre-trained data, which might be outdated. So, if I want real-time data, it will mostly provide old information that isn't useful. This is where agentic AI comes in with its agents and tools. Suppose we want a real-time weather update for our area. The tool we would use is a weather API, which the agent calls when a user asks for it. Using this tool, the agent can provide a real-time weather update.

Now, let's look at an example where we want a weather update from our Agentic AI by following these steps:

For the first step, we need a tool to assist our agent. In this case, we will use a weather API.

  async function getWeatherDetailsByCity(cityname = "") {
    const url = `https://wttr.in/${cityname.toLowerCase()}?format=%C+%t`;
    const { data } = await axios.get(url, { responseType: "text" });
    return `The current weather of ${cityname} is ${data}`;
  }

Now we need a map to store all our tools.

  const TOOL_MAP = {
    getWeatherDetailsByCity: getWeatherDetailsByCity,
  };

Now, we will define our SYSTEM_PROMPT, which we will provide to our model to specify its role in calling the tool and presenting the result to the user in a clear and friendly manner.

  const SYSTEM_PROMPT = `
      You are an AI assistant who works on START, THINK and OUTPUT format.
      For a given user query first think and breakdown the problem into sub problems.
      You should always keep thinking and thinking before giving the actual output.

      Also, before outputing the final result to user you must check once if everything is correct.
      You also have list of available tools that you can call based on user query.

      For every tool call that you make, wait for the OBSERVATION from the tool which is the
      response from the tool that you called.

      Available Tools:
      - getWeatherDetailsByCity(cityname: string): Returns the current weather data of the city.

      Rules:
      - Strictly follow the output JSON format
      - Always follow the output in sequence that is START, THINK, OBSERVE and OUTPUT.
      - Always perform only one step at a time and wait for other step.
      - Alway make sure to do multiple steps of thinking before giving out output.
      - For every tool call always wait for the OBSERVE which contains the output from tool

      Output JSON Format:
      { "step": "START | THINK | OUTPUT | OBSERVE | TOOL" , "content": "string", "tool_name": "string", "input": "STRING" }

      Example:
      User: Hey, can you tell me weather of Patiala?
      ASSISTANT: { "step": "START", "content": "The user is intertested in the current weather details about Patiala" } 
      ASSISTANT: { "step": "THINK", "content": "Let me see if there is any available tool for this query" } 
      ASSISTANT: { "step": "THINK", "content": "I see that there is a tool available getWeatherDetailsByCity which returns current weather data" } 
      ASSISTANT: { "step": "THINK", "content": "I need to call getWeatherDetailsByCity for city patiala to get weather details" }
      ASSISTANT: { "step": "TOOL", "input": "patiala", "tool_name": "getWeatherDetailsByCity" }
      DEVELOPER: { "step": "OBSERVE", "content": "The weather of patiala is cloudy with 27 Cel" }
      ASSISTANT: { "step": "THINK", "content": "Great, I got the weather details of Patiala" }
      ASSISTANT: { "step": "OUTPUT", "content": "The weather in Patiala is 27 C with little cloud. Please make sure to carry an umbrella with you. ☔️" }
    `;

Now, we will ask our model to perform the task.

  async function main() {

    const client = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
    });

    const messages = [
      {
        role: "system",
        content: SYSTEM_PROMPT,
      },
      {
        role: "user",
        content:
          "Hey can you clone https://code.visualstudio.com/ in drive D inside vsCode-clone folder?",
      },
    ];

    while (true) {
      const response = await client.chat.completions.create({
        model: "gpt-4.1-mini",
        messages: messages,
      });

      const rawContent = response.choices[0].message.content;
      const parsedContent = JSON.parse(rawContent);

      messages.push({
        role: "assistant",
        content: JSON.stringify(parsedContent),
      });

      if (parsedContent.step === "START") {
        console.log(`🔥`, parsedContent.content);
        continue;
      }

      if (parsedContent.step === "THINK") {
        console.log(`\t🧠`, parsedContent.content);
        continue;
      }

      if (parsedContent.step === "TOOL") {
        console.log(parsedContent);
        const toolToCall = parsedContent.tool_name;
        if (!TOOL_MAP[toolToCall]) {
          messages.push({
            role: "developer",
            content: `There is no such tool as ${toolToCall}`,
          });
          continue;
        }

        const responseFromTool = await TOOL_MAP[toolToCall](parsedContent.input);
        console.log(
          `🛠️: ${toolToCall}(${parsedContent.input}) = `,
          responseFromTool
        );
        messages.push({
          role: "developer",
          content: JSON.stringify({ step: "OBSERVE", content: responseFromTool }),
        });
        continue;
      }

      if (parsedContent.step === "OUTPUT") {
        console.log(`🤖`, parsedContent.content);
        break;
      }
    }

    console.log("Done...");
  }

  main();

This was my explanation of Agentic AI, its agents, and tools. I started by defining what they are and what they do, then I gave a brief example to help us understand it better.

System Prompts and Prompting Styles.

Vipul kant chaturvedi — Fri, 15 Aug 2025 14:33:38 GMT

When we write to any of our AI models, it's important to be very specific about our issue because AI models are just machines that respond accurately only when our messages are precise. The inputs we provide are technically known as prompts.

What are prompting styles? They are simply the ways of writing a prompt. Each AI model has its prompting style. Below are a few popular prompting styles:

Alpaca Prompt: ###Instructions: (Enter your instruction here) ###Input: (Additional instructions for the model) ###Response: (Response from the model)
LLaMA-2: [INST] Here comes your instruction [/INST]
FLAN-T5: “Question: Then here is your instruction.”

These are some prompting styles used by different models like OpenAI, Gemini, etc.

Now, System Prompting is a style where we guide our model's behavior by defining its role, setting boundaries, and maintaining its role throughout the conversation. There are several types of System Prompting:

Zero-Shot Prompting: Here, we give the model a direct task or question without any prior examples or questions.

  { role: 'user', content: 'What is my name?' },

Few-Shot Prompting: In this method, we provide our model with some questions and examples to establish its behavior and role.
- Providing examples and questions is very important, so our model behaves exactly as we want.

  { role: 'system', content: ` You're an AI assistant expert in coding with Javascript. You only and only know Javascript as coding language. If user asks anything other than Javascript coding question, Do not ans that question. You are an AI from ChaiCode which is an EdTech company transforming modern tech knowledge. Your name is ChaiCode and always ans as if you represent ChaiCode

  Examples: Q: Hey There A: Hey, Nice to meet you. How can I help you today? Do you want me to show what we are cooking at ChaiCode.

  Q: Hey, I want to learn Javascript A: Sure, Why don't you visit our website ot YouTube at chaicode for more info.

  Q: I am bored A: What about a JS Quiz?

  Q: Can you write a code in Python? A: I can, but I am designed to help in JS `, }

Chain-of-Thoughts Prompting: In this method, the model is encouraged to break down reasoning step by step before arriving at an answer.
- Here, we also define the behavior and questions for our model, but it will be more detailed and different from few-shot prompting.

      You are an AI assistant who works on START, THINK and OUTPUT format.
      For a given user query first think and breakdown the problem into sub problems.
      You should always keep thinking and thinking before giving the actual output.
      Also, before outputing the final result to user you must check once if everything is correct.

      Rules:
      - Strictly follow the output JSON format
      - Always follow the output in sequence that is START, THINK, EVALUATE and OUTPUT.
      - After evey think, there is going to be an EVALUATE step that is performed manually by someone and you need to wait for it.
      - Always perform only one step at a time and wait for other step.
      - Alway make sure to do multiple steps of thinking before giving out output.

      Output JSON Format:
      { "step": "START | THINK | EVALUATE | OUTPUT", "content": "string" }

      Example:
      User: Can you solve 3 + 4 * 10 - 4 * 3
      ASSISTANT: { "step": "START", "content": "The user wants me to solve 3 + 4 * 10 - 4 * 3 maths problem" } 
      ASSISTANT: { "step": "THINK", "content": "This is typical math problem where we use BODMAS formula for calculation" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Lets breakdown the problem step by step" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "As per bodmas, first lets solve all multiplications and divisions" }
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" }  
      ASSISTANT: { "step": "THINK", "content": "So, first we need to solve 4 * 10 that is 40" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Great, now the equation looks like 3 + 40 - 4 * 3" }
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Now, I can see one more multiplication to be done that is 4 * 3 = 12" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "Great, now the equation looks like 3 + 40 - 12" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "As we have done all multiplications lets do the add and subtract" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "so, 3 + 40 = 43" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "new equations look like 43 - 12 which is 31" } 
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" } 
      ASSISTANT: { "step": "THINK", "content": "great, all steps are done and final result is 31" }
      ASSISTANT: { "step": "EVALUATE", "content": "Alright, Going good" }  
      ASSISTANT: { "step": "OUTPUT", "content": "3 + 4 * 10 - 4 * 3 = 31" }

Self-Consistency Prompting: In this approach, we give our query to two different AI models. Then, we pass the output from both models to a third model, where we ask it to return the most efficient and correct answer from the two.
Person-Based Prompting: In this method, the model is instructed to act as a specific character or profession.
- To make the model behave like a specific character, we provide it with many examples and details about the character, including how it should speak and what its boundaries should be.

  {
          role: 'system',
          content: `
                  You are an AI assistant who is XYZ(Name). You are a persona of a developer named
                  XYZ(Name) who is an amazing developer.

                  Characteristics of XYZ(Name)
                  - Full Name: XYZ(Name)
                  - Age: 123 Years old
                  - Date of birthday: 1st feb, 2025

                  Social Links:
                  - LinkedIn URL: 
                  - X URL: 

                  Examples of text on how XYZ(Name)typically chats or replies:
                  - Hey user, Yes
                  - This can be done.
                  - Sure, I will do this

              `,
        },

In conclusion, prompting styles and system prompts are critical for our AI models to provide accurate and efficient responses, which can increase their demand. If our model is not good at these things, its demand will decrease, and people will not use it.

Explaining Vector Embedding to all the moms.

Vipul kant chaturvedi — Wed, 13 Aug 2025 10:13:52 GMT

In today’s modern AI time vector embedding is very import thing used in all the AI models but very less people know about this mostly non-tech people like our moms. So, Today I m going to try explaining vector embedding to our mums.

Now, let me start with explaining the meaning of vector embedding. In AI world vector is nothing much just the list of several number. Length can be any for every word the length will be different. e.g.: [1, 2.22, 334, ..]. In simple terms embedding just means to convert any word or sentence in vector form. We need this because AI is just a machine and machines do not understand our language like Hindi, English or any language you speak. They only understand numbers so to make them work we first embed our sentences/words into vectors.

So, suppose we say to our machines two words like “King” and “Queen” now they have to tell how close they are or better to say how they relate. So what our machines will do is first they convert both the words into vectors and if vectors of both the word are very close then it will define as they relate to each other. In this case vectors will be very close and it will give the result as they relate or close to each other. Let’s take another example of “shoe” and ”lion” when we convert these into vectors those will be very very far not even close to each other then on that basis our AI models will tell us they are in any way do not relate to each other.

Now with this I want to conclude that vector embedding is nothing just a model or technique we use in our ai models where they map each word in graph by first converting into vectors then they will map them on the graph point and the words which relate to each other you will see that those are very close to each other. And these graph is not our normal graph it has many dimensions each AI model has their own vector embedding of different dimensions which vary.

Explaining tokenization to a 1st year university student.

Vipul kant chaturvedi — Wed, 13 Aug 2025 07:00:42 GMT

If we break down tokenization into two different parts: tokens + ization, where tokens mean a small chunk or piece of anything, and ization means converting to something, then if we combine both, tokenization means converting something into tokens or chunks or small pieces.

Now let's discuss why we need tokenization and where it's used. In the world of growing AI models like ChatGPT, Gemini, and Perplexity, tokenization is essential. These models rely on predictions based on tokenization. Here's how these models benefit from tokenization, following a few steps:

They break a full sentence into tokens or small pieces. This can be done in different ways, like by each word or even parts of a word.
- For example, “Hey, I love ice cream.”
- When converted into tokens, it becomes [“Hey”, “,”, “I”, “love”, “Ice”, “cream”].
Now the models will convert these tokens into numbers.
- The tokens ["Hey", ",", "I", "love", "Ice", "cream"] will be converted to numbers like ["111", "12", "44", "5698", "986", "25896"]..
Now, using these converted numbers, the models will predict the next word or, more accurately, the next token.

This is how tokenization actually works.

Earlier, we learned that tokenization involves breaking a full sentence into smaller tokens. But now the question is: how do we decide how to divide the sentence? Should it be by each character, each word, or something else? The answer is that each AI model uses a different technique for tokenization. Here are a few examples:

a) Word Tokenization

Splitting text by spaces into whole words.
Problem: It doesn't work well for unknown words or different forms of words.
"playing" and "played" are treated as completely different tokens.

b) Subword Tokenization

Break words into smaller chunks that can be combined.
"playing" might be split into "play" + "ing".
This helps the model understand rare words and new combinations.

c)Character Tokenization

Each character (letter, punctuation, space) is treated as a token.
"cat" becomes "c", "a", "t".

Now, with this, I want to conclude that tokenization is just a technique for AI models that helps them predict the next word, or more accurately, the next token.

Explaining GPT to a kid

Vipul kant chaturvedi — Tue, 12 Aug 2025 19:21:43 GMT

Today, applications like ChatGPT, Gemini, and Perplexity are becoming very popular. Airtel even partnered with Perplexity AI to offer its users a free 12-month Pro subscription, valued at around ₹17,000 per year. People of all ages are using these AI services, from grandparents to employees for daily tasks, and even kids for entertainment, passing time, and homework. Many users, including kids, use these tools without fully understanding what they are, how they work, or how they provide the exact information requested.

If we break down "GPT," it stands for Generative Pre-trained Transformer, which explains itself:

G = generative → It means it creates something.
P = pre-trained → It's already been trained on a lot of data before you start using it. This data is provided to the model beforehand, so what it generates is based on what it has learned.
T = transformer → It takes input, processes it, and produces output. The output is completely based on the pre-trained data we already gave to the model.

So, the above explanation was mostly on the basis of its full form. We can break it down into simpler words, i.e., the work of these AI models is that they are just predicting on the basis of what comes next. So, how it works is first it takes the full sentence, then it breaks the sentence into small parts, which can be based on anything, like each word, each character, anything. So after splitting what it does is it assigns each splitted term which we call it as token a number can be anything which number has its token and based on that number it predicts the next word or anything. in splitting ai model considers commas and psaces as well and assign them number too, not only these but all special character.

So, this was from my side a small explanation for a kid of what GPT is actually and how it works, what its functionality is. Thank you for giving your special time to my blog.