feat:update demo code of CLIP

2026-02-12 11:19:52 +08:00 · 2026-02-12 11:19:52 +08:00 · 5478a8618b
commit 5478a8618b
parent 4bf4aafc73
12 changed files with 50385 additions and 694 deletions
--- a/examples/clip/000000004505.jpg
+++ b/examples/clip/000000004505.jpg
--- a/examples/clip/README.md
+++ b/examples/clip/README.md
@ -1,95 +1,159 @@
-## Demo Run
-
-### CPP
-
-#### 1. Compile
-
-**Prerequisites:**
- Android NDK (r25e recommended)
- `ANDROID_NDK_PATH` environment variable set
-
-**Build:**
-```bash
-# Build for arm64-v8a
-cd examples/clip/cpp
-./build-android.sh -a arm64-v8a
-```
-
-The executable will be generated at `build/android_arm64-v8a/clip_demo` (Note: executable name may vary, verify in build folder).
-
-#### 2. Run
-
-```bash
-# Push executable to device
-adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
-adb push model/vision_model_int8_A311D2.adla /data/local/tmp/
-adb push clip_datasets/ /data/local/tmp/
-adb push test_hat_0.jpg /data/local/tmp/
-
-# Run on device
-adb shell
-cd /data/local/tmp
-chmod +x clip_demo
-export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
-
-# Usage: ./clip_demo <model_path> [base_dir] [json_filename]
-./clip_demo vision_model_int8_A311D2.adla ./clip_datasets/ clip_text_res.json
-```
-
-**Note:** 
- Replace `vision_model_int8_A311D2.adla` with your actual model file path.
- The `base_dir` and `json_filename` parameters are optional. You can also use environment variables `CLIP_BASE_DIR` and `CLIP_JSON_FILENAME`.
- The program will prompt you to enter image paths interactively. Enter "exit" to quit.
-
-### Python
-
-**Prerequisites:**
- Python 3.10
- Required packages: `numpy`, `Pillow`, `amlnnlite`
-
-**Install dependencies:**
-```bash
-pip install numpy Pillow amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
-```
-
-**Run on device:**
-```bash
-# Basic usage (process current directory)
-python clip.py --model-path ./vision_model_int8_A311D2.adla
-
-# Specify image directory or file
-python clip.py --model-path ./vision_model_int8_A311D2.adla --image-dir ./
-
-# Specify base directory and JSON filename
-python clip.py --model-path ./vision_model_int8_A311D2.adla --base-dir ./clip_datasets/ --json-filename clip_text_res.json
-```
-
-The script will automatically process all image files (`.jpg`, `.jpeg`, `.png`, `.bmp`) in the specified directory or process a single image file, and display the best matching dataset for each image.
-
-5. Results
-
-The program will print the best matching dataset path for each processed image. The program searches through all dataset folders in the base directory and finds the text feature with the highest similarity to the input image.
-
-**Example output:**
-```
-# python demo result
-Model initialized successfully.
-
-Found 2 image file(s) to process
-Searching in base directory: ./clip_datasets/
-
-Processing image: test_jacket_0.jpg
-  Best matching dataset: ./clip_datasets/shirt10_jacket7
-Searching in base directory: ./clip_datasets/
-
-Processing image: test_hat_0.jpg
-  Best matching dataset: ./clip_datasets/hat1_jd
-
-Total results: 2
-Index[0]: ./clip_datasets/shirt10_jacket7
-Index[1]: ./clip_datasets/hat1_jd
-
-Done.
-```
-
-The program returns the dataset folder path that contains the text feature with the highest similarity to the input image. Each result represents the best matching dataset for the corresponding input image.
+# CLIP
+
+## 1. Overview
+
+This demo demonstrates how to run CLIP (Contrastive Language-Image Pre-Training) image-text matching using AMLNNLite. The CLIP model consists of two parts: a vision encoder and a text encoder, which work together to compute similarity between images and text descriptions.
+
+## 2. Model Download
+
+TO DO
+
+## 3. Model Conversion
+
+TO DO
+
+## 4. Demo Run
+
+### CPP
+
+#### 1. Compile
+
+**Prerequisites:**
+- Android NDK (r25e recommended)
+- `ANDROID_NDK_PATH` environment variable set
+
+**Build:**
+```bash
+# Build for arm64-v8a
+cd examples/clip/cpp
+./build-android.sh -a arm64-v8a
+```
+
+The executable will be generated at `build/android_arm64-v8a/clip_demo`.
+
+#### 2. Run
+
+```bash
+# Push executable and resources to device
+adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
+adb push model/vision_model_int8_S905X5.adla /data/local/tmp/
+adb push model/text_model_int8_S905X5.adla /data/local/tmp/
+adb push tokenizer_path/ /data/local/tmp/
+
+# Run on device
+adb shell
+cd /data/local/tmp
+chmod +x clip_demo
+export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
+
+# Usage: ./clip_demo <vision_model> <text_model> <tokenizer_path> [--profiling]
+./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path/
+```
+
+The program will prompt for image paths and text descriptions interactively. Enter the path to an image file, then enter comma-separated text descriptions (or `skip` to use defaults). Type `exit` to quit.
+
+**Argument Descriptions:**
+
+| Argument       | Description                                                  |
+| -------------- | ------------------------------------------------------------ |
+| vision_model   | Path to vision encoder .adla model (required)                |
+| text_model     | Path to text encoder .adla model (required)                  |
+| tokenizer_path  | Path to directory containing `vocab.json` and `merges.txt` (required) |
+| --profiling    | Enable performance profiling output (optional)               |
+
+**Note:** The `tokenizer_path` should contain `vocab.json` and `merges.txt` files from the CLIP tokenizer (e.g., from `openai/clip-vit-base-patch32`).
+
+### Python
+
+**Prerequisites:**
+- Python 3.10
+- Required packages: `numpy`, `Pillow`, `transformers`, `amlnnlite`
+
+**Install dependencies:**
+```bash
+pip install numpy Pillow transformers amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
+```
+
+**Run on device:**
+```bash
+python clip.py \
+    --vision-model ./vision_model_int8_S905X5.adla \
+    --text-model ./text_model_int8_S905X5.adla \
+    --tokenizer-dir ./tokenizer_path \
+    --image-path ./000000004505.jpg \
+    --texts "a red handbag" "a blue jacket" "a red bus"
+```
+
+**Interactive Mode (Recommended):**
+
+If you don't provide `--image-path`, the program will run in interactive mode:
+
+```bash
+python clip.py \
+    --vision-model ./vision_model_int8_S905X5.adla \
+    --text-model ./text_model_int8_S905X5.adla \
+    --tokenizer-dir ./tokenizer_path
+```
+
+The program will prompt for image paths and text descriptions. Enter an image path to process, then enter comma-separated texts to compare. Type `exit` to quit.
+
+**Argument Descriptions:**
+
+| Argument         | Description                                                  |
+| ---------------- | ------------------------------------------------------------ |
+| --vision-model   | Path to vision encoder .adla model (required)                |
+| --text-model     | Path to text encoder .adla model (required)                  |
+| --tokenizer-dir  | Path to CLIPTokenizer directory (required)                   |
+| --image-path     | Path to input image (.jpg, .png) - optional, will prompt if not provided |
+| --texts          | List of text descriptions to compare (space-separated)       |
+| --max-len        | Maximum token sequence length, default is 64                 |
+| --logit-scale    | Logit scale factor, default is 100.0                         |
+
+**Note:** The `--tokenizer-dir` should point to the directory containing the CLIPTokenizer files. You can use a Hugging Face model ID (e.g., `openai/clip-vit-base-patch32`) or a local directory.
+
+## 5. Results
+
+**Performance Feedback**
+
+By using the `--profiling` flag (C++) or setting the loglevel to INFO, the program provides real-time performance metrics upon completion. The console log will display essential hardware and execution details, including:
+- Hardware Information: System and ADLA library versions.
+- Model Overview: Basic input/output configurations.
+- NPU Metrics: Total inference time (latency) and total DRAM bandwidth consumption.
+
+**Interactive Mode Example:**
+
+```bash
+$ ./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path
+
+[Info] Models initialized successfully.
+
+============================================================
+[Info] Image Path (or 'exit' to quit):
+000000004505.jpg
+[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):
+a red handbag, a blue jacket, a red bus
+
+[Info] Processing image: 000000004505.jpg
+[Info] Image embedding size: 512
+[Info] Processing 3 text(s)...
+[Info] Text embeddings size: 3 x 512
+
+============================================================
+CLIP Image-Text Matching Results
+============================================================
+Image: 000000004505.jpg
+logit_scale: 100.000000
+------------------------------------------------------------
+[1] prob=0.999975  sim=0.327895  text='a red bus'
+[2] prob=0.000016  sim=0.217690  text='a red handbag'
+[3] prob=0.000008  sim=0.211029  text='a blue jacket'
+============================================================
+
+============================================================
+[Info] Image Path (or 'exit' to quit):
+exit
+[Info] Exiting...
+Free vision model memory.
+Free text model memory.
+[Info] Done.
+```
--- a/examples/clip/cpp/src/CMakeLists.txt
+++ b/examples/clip/cpp/src/CMakeLists.txt
@ -1,42 +1,43 @@
-cmake_minimum_required(VERSION 3.5)
-project(clip_demo)
-
-set(CMAKE_CXX_STANDARD 17)
-
-# Set NNSDK path
-set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
-include_directories(${NNSDK_ROOT}/include)
-include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
-
-# Set 3rdparty path
-set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
-
-# Include directories for stb_image and json
-# Note: code uses #include "stb_image.h" and #include "json.hpp"
-include_directories(${3RDPARTY_DIR}/stb_image)
-include_directories(${3RDPARTY_DIR}/json)
-
-if(CMAKE_SYSTEM_NAME STREQUAL "Android")
-    if (ANDROID_ABI STREQUAL "arm64-v8a")
-        link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
-    else()
-        link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
-    endif()
-    # Android needs log
-    link_libraries(log)
-elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
-    link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
-endif()
-
-add_executable(${PROJECT_NAME}
-    main.cpp
-    model_invoke.cpp
-    pre_postprocess.cpp
-)
-
-target_link_libraries(${PROJECT_NAME}
-    nnsdk
-    dl
-    m
-)
-
+cmake_minimum_required(VERSION 3.5)
+project(clip_demo)
+
+set(CMAKE_CXX_STANDARD 17)
+
+# Set NNSDK path
+set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
+include_directories(${NNSDK_ROOT}/include)
+include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
+
+# Set 3rdparty path
+set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
+
+# Include directories for stb_image and json
+# Note: code uses #include "stb_image.h" and #include "json.hpp"
+include_directories(${3RDPARTY_DIR}/stb_image)
+include_directories(${3RDPARTY_DIR}/json)
+
+if(CMAKE_SYSTEM_NAME STREQUAL "Android")
+    if (ANDROID_ABI STREQUAL "arm64-v8a")
+        link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
+    else()
+        link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
+    endif()
+    # Android needs log
+    link_libraries(log)
+elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
+    link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
+endif()
+
+add_executable(${PROJECT_NAME}
+    main.cpp
+    model_invoke.cpp
+    pre_postprocess.cpp
+    clip_tokenizer.cpp
+)
+
+target_link_libraries(${PROJECT_NAME}
+    nnsdk
+    dl
+    m
+)
+
--- a/examples/clip/cpp/src/clip_process.h
+++ b/examples/clip/cpp/src/clip_process.h
@ -0,0 +1,53 @@
+/*
+ * Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef CLIP_PROCESS_H
+#define CLIP_PROCESS_H
+
+#include <string>
+#include <vector>
+#include <cstdint>
+
+// ==================== Model Invoke ====================
+
+// Initialize network from file
+void* init_network_file(const char *model_path);
+
+// Run vision model inference
+std::vector<float> run_vision_model(void* context, const std::vector<float>& input_data);
+
+// Run text model inference
+std::vector<float> run_text_model(void* context, const std::vector<int64_t>& input_ids);
+
+// Destroy network
+int destroy_network(void *qcontext);
+
+// ==================== Pre/Post Processing ====================
+
+// Image preprocessing
+std::vector<float> preprocess_image(const std::string& image_path);
+
+// L2 normalize
+std::vector<float> l2_normalize(const std::vector<float>& vec);
+
+// Softmax
+std::vector<float> softmax(const std::vector<float>& logits);
+
+// Compute cosine similarity
+float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale = 100.0f);
+
+#endif // CLIP_PROCESS_H
+
--- a/examples/clip/cpp/src/clip_tokenizer.cpp
+++ b/examples/clip/cpp/src/clip_tokenizer.cpp
@ -0,0 +1,395 @@
+/*
+ * Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "clip_tokenizer.h"
+#include "json.hpp"
+
+#include <fstream>
+#include <sstream>
+#include <iostream>
+#include <algorithm>
+#include <regex>
+#include <set>
+#include <cassert>
+#include <codecvt>
+#include <locale>
+
+using json = nlohmann::ordered_json;
+
+// Reference: https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py
+
+void CLIPTokenizer::init_byte_to_unicode()
+{
+    byte_to_unicode_.clear();
+    unicode_to_byte_.clear();
+
+    // Printable ASCII ranges that map to themselves
+    // '!' (33) to '~' (126), '¡' (161) to '¬' (172), '®' (174) to 'ÿ' (255)
+    std::vector<int> bs;
+    for (int i = 33; i <= 126; ++i) bs.push_back(i);    // '!' to '~'
+    for (int i = 161; i <= 172; ++i) bs.push_back(i);   // '¡' to '¬'
+    for (int i = 174; i <= 255; ++i) bs.push_back(i);   // '®' to 'ÿ'
+
+    std::vector<int> cs(bs.begin(), bs.end());
+
+    // Map remaining bytes (0-32, 127-160, 173) to 256+
+    int n = 0;
+    for (int b = 0; b < 256; ++b) {
+        if (std::find(bs.begin(), bs.end(), b) == bs.end()) {
+            bs.push_back(b);
+            cs.push_back(256 + n);
+            n++;
+        }
+    }
+
+    for (size_t i = 0; i < bs.size(); ++i) {
+        byte_to_unicode_[static_cast<uint8_t>(bs[i])] = static_cast<char32_t>(cs[i]);
+        unicode_to_byte_[static_cast<char32_t>(cs[i])] = static_cast<uint8_t>(bs[i]);
+    }
+}
+
+// ========== UTF-8 Helpers ==========
+
+std::vector<char32_t> CLIPTokenizer::utf8_to_codepoints(const std::string& str)
+{
+    std::vector<char32_t> result;
+    size_t i = 0;
+    while (i < str.size()) {
+        char32_t cp = 0;
+        unsigned char c = str[i];
+        int len = 0;
+        if (c < 0x80) {
+            cp = c;
+            len = 1;
+        } else if ((c & 0xE0) == 0xC0) {
+            cp = c & 0x1F;
+            len = 2;
+        } else if ((c & 0xF0) == 0xE0) {
+            cp = c & 0x0F;
+            len = 3;
+        } else if ((c & 0xF8) == 0xF0) {
+            cp = c & 0x07;
+            len = 4;
+        } else {
+            ++i;
+            continue;
+        }
+        for (int j = 1; j < len && (i + j) < str.size(); ++j) {
+            cp = (cp << 6) | (str[i + j] & 0x3F);
+        }
+        result.push_back(cp);
+        i += len;
+    }
+    return result;
+}
+
+std::string CLIPTokenizer::codepoints_to_utf8(const std::vector<char32_t>& cps)
+{
+    std::string result;
+    for (char32_t cp : cps) {
+        if (cp < 0x80) {
+            result += static_cast<char>(cp);
+        } else if (cp < 0x800) {
+            result += static_cast<char>(0xC0 | (cp >> 6));
+            result += static_cast<char>(0x80 | (cp & 0x3F));
+        } else if (cp < 0x10000) {
+            result += static_cast<char>(0xE0 | (cp >> 12));
+            result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
+            result += static_cast<char>(0x80 | (cp & 0x3F));
+        } else {
+            result += static_cast<char>(0xF0 | (cp >> 18));
+            result += static_cast<char>(0x80 | ((cp >> 12) & 0x3F));
+            result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
+            result += static_cast<char>(0x80 | (cp & 0x3F));
+        }
+    }
+    return result;
+}
+
+// ========== Load Functions ==========
+
+bool CLIPTokenizer::load(const std::string& vocab_path, const std::string& merges_path)
+{
+    init_byte_to_unicode();
+
+    // Load vocab.json
+    {
+        std::ifstream file(vocab_path);
+        if (!file.is_open()) {
+            std::cerr << "Failed to open vocab file: " << vocab_path << std::endl;
+            return false;
+        }
+
+        try {
+            json j;
+            file >> j;
+            for (auto it = j.begin(); it != j.end(); ++it) {
+                std::string token = it.key();
+                int id = it.value().get<int>();
+                token_to_id_[token] = id;
+                id_to_token_[id] = token;
+            }
+        } catch (const std::exception& e) {
+            std::cerr << "Error parsing vocab.json: " << e.what() << std::endl;
+            return false;
+        }
+    }
+
+    // Find special token IDs
+    if (token_to_id_.count("<|startoftext|>")) {
+        sot_token_id_ = token_to_id_["<|startoftext|>"];
+    }
+    if (token_to_id_.count("<|endoftext|>")) {
+        eot_token_id_ = token_to_id_["<|endoftext|>"];
+    }
+
+    // Load merges.txt
+    {
+        std::ifstream file(merges_path);
+        if (!file.is_open()) {
+            std::cerr << "Failed to open merges file: " << merges_path << std::endl;
+            return false;
+        }
+
+        std::string line;
+        int rank = 0;
+
+        // Skip header line "#version: ..." if present
+        if (std::getline(file, line)) {
+            if (line.find("#version") == std::string::npos) {
+                // First line is not a header, process it
+                std::istringstream iss(line);
+                std::string a, b;
+                if (iss >> a >> b) {
+                    bpe_ranks_[{a, b}] = rank++;
+                }
+            }
+        }
+
+        while (std::getline(file, line)) {
+            if (line.empty()) continue;
+            std::istringstream iss(line);
+            std::string a, b;
+            if (iss >> a >> b) {
+                bpe_ranks_[{a, b}] = rank++;
+            }
+        }
+    }
+
+    loaded_ = true;
+    printf("[Info] CLIPTokenizer loaded: vocab_size=%zu, merges=%zu\n",
+           token_to_id_.size(), bpe_ranks_.size());
+    return true;
+}
+
+bool CLIPTokenizer::load_from_dir(const std::string& tokenizer_dir)
+{
+    std::string dir = tokenizer_dir;
+    // Ensure trailing slash
+    if (!dir.empty() && dir.back() != '/' && dir.back() != '\\') {
+        dir += "/";
+    }
+    return load(dir + "vocab.json", dir + "merges.txt");
+}
+
+// ========== BPE Implementation ==========
+
+std::string CLIPTokenizer::bytes_to_unicode_str(const std::string& raw) const
+{
+    std::vector<char32_t> result;
+    for (unsigned char c : raw) {
+        auto it = byte_to_unicode_.find(c);
+        if (it != byte_to_unicode_.end()) {
+            result.push_back(it->second);
+        }
+    }
+    return codepoints_to_utf8(result);
+}
+
+std::vector<std::string> CLIPTokenizer::bpe(const std::string& token) const
+{
+    // Convert token to individual unicode characters as strings
+    auto codepoints = utf8_to_codepoints(token);
+    if (codepoints.empty()) return {};
+
+    // Each character becomes a separate piece
+    std::vector<std::string> word;
+    for (size_t i = 0; i < codepoints.size(); ++i) {
+        std::string piece = codepoints_to_utf8({codepoints[i]});
+        // CLIP adds </w> to the last character
+        if (i == codepoints.size() - 1) {
+            piece += "</w>";
+        }
+        word.push_back(piece);
+    }
+
+    if (word.size() == 1) return word;
+
+    // Iteratively merge the most frequent pairs
+    while (true) {
+        if (word.size() < 2) break;
+
+        // Find the pair with the lowest rank
+        int best_rank = INT_MAX;
+        int best_idx = -1;
+
+        for (size_t i = 0; i < word.size() - 1; ++i) {
+            auto it = bpe_ranks_.find({word[i], word[i + 1]});
+            if (it != bpe_ranks_.end() && it->second < best_rank) {
+                best_rank = it->second;
+                best_idx = static_cast<int>(i);
+            }
+        }
+
+        if (best_idx == -1) break;  // No more merges possible
+
+        // Merge the pair at best_idx
+        std::string merged = word[best_idx] + word[best_idx + 1];
+        std::vector<std::string> new_word;
+        for (size_t i = 0; i < word.size(); ++i) {
+            if (static_cast<int>(i) == best_idx) {
+                new_word.push_back(merged);
+                ++i;  // Skip next element
+            } else {
+                new_word.push_back(word[i]);
+            }
+        }
+        word = new_word;
+    }
+
+    return word;
+}
+
+std::vector<std::string> CLIPTokenizer::pre_tokenize(const std::string& text) const
+{
+    // CLIP tokenizer: lowercase + basic clean + split by pattern
+    // Pattern from CLIP: <\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+
+    // Simplified version for ASCII-dominant text:
+
+    std::string cleaned;
+    // Lowercase and basic whitespace normalization
+    for (char c : text) {
+        if (c >= 'A' && c <= 'Z') {
+            cleaned += (c - 'A' + 'a');
+        } else {
+            cleaned += c;
+        }
+    }
+
+    // Simple tokenization: split by whitespace and punctuation
+    std::vector<std::string> words;
+    std::string current;
+
+    for (size_t i = 0; i < cleaned.size(); ++i) {
+        char c = cleaned[i];
+
+        if (c == ' ' || c == '\t' || c == '\n' || c == '\r') {
+            if (!current.empty()) {
+                words.push_back(current);
+                current.clear();
+            }
+            // Add space prefix to next word (CLIP uses space-prefixed tokens)
+            if (i + 1 < cleaned.size() && cleaned[i + 1] != ' ') {
+                // Next word will get a space prefix via the byte encoding
+            }
+        } else {
+            // Check if punctuation should be separate token
+            bool is_alpha_or_digit = (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9');
+            bool cur_is_alpha = !current.empty() &&
+                ((current.back() >= 'a' && current.back() <= 'z') ||
+                 (current.back() >= '0' && current.back() <= '9'));
+
+            if (!current.empty() && !is_alpha_or_digit && cur_is_alpha) {
+                // Start new token for punctuation
+                words.push_back(current);
+                current.clear();
+            } else if (!current.empty() && is_alpha_or_digit && !cur_is_alpha) {
+                words.push_back(current);
+                current.clear();
+            }
+            current += c;
+        }
+    }
+    if (!current.empty()) {
+        words.push_back(current);
+    }
+
+    return words;
+}
+
+// ========== Encode ==========
+
+std::vector<int64_t> CLIPTokenizer::encode(const std::string& text, int max_len) const
+{
+    if (!loaded_) {
+        std::cerr << "Tokenizer not loaded!" << std::endl;
+        return std::vector<int64_t>(max_len, 0);
+    }
+
+    std::vector<int64_t> tokens;
+
+    // Add start-of-text token
+    tokens.push_back(sot_token_id_);
+
+    // Pre-tokenize
+    std::vector<std::string> words = pre_tokenize(text);
+
+    // Process each word
+    for (const auto& word : words) {
+        // Convert raw bytes to unicode representation
+        std::string unicode_word = bytes_to_unicode_str(word);
+
+        // Apply BPE
+        std::vector<std::string> bpe_tokens = bpe(unicode_word);
+
+        // Look up token IDs
+        for (const auto& bt : bpe_tokens) {
+            auto it = token_to_id_.find(bt);
+            if (it != token_to_id_.end()) {
+                tokens.push_back(it->second);
+            } else {
+                // Unknown token, try without </w>
+                std::string no_ew = bt;
+                if (no_ew.size() >= 4 && no_ew.substr(no_ew.size() - 4) == "</w>") {
+                    no_ew = no_ew.substr(0, no_ew.size() - 4);
+                }
+                auto it2 = token_to_id_.find(no_ew);
+                if (it2 != token_to_id_.end()) {
+                    tokens.push_back(it2->second);
+                }
+                // else: skip unknown token
+            }
+        }
+    }
+
+    // Add end-of-text token
+    tokens.push_back(eot_token_id_);
+
+    // Truncate if necessary
+    if (static_cast<int>(tokens.size()) > max_len) {
+        tokens.resize(max_len);
+        // Ensure EOT is at the end
+        tokens.back() = eot_token_id_;
+    }
+
+    // Pad to max_len with EOT token (consistent with HuggingFace CLIPTokenizer)
+    while (static_cast<int>(tokens.size()) < max_len) {
+        tokens.push_back(eot_token_id_);
+    }
+
+    return tokens;
+}
+
--- a/examples/clip/cpp/src/clip_tokenizer.h
+++ b/examples/clip/cpp/src/clip_tokenizer.h
@ -0,0 +1,105 @@
+/*
+ * Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef CLIP_TOKENIZER_H
+#define CLIP_TOKENIZER_H
+
+#include <string>
+#include <vector>
+#include <map>
+#include <unordered_map>
+
+class CLIPTokenizer {
+public:
+    CLIPTokenizer() = default;
+
+    /**
+     * Load tokenizer from vocab.json and merges.txt
+     * @param vocab_path   Path to vocab.json
+     * @param merges_path  Path to merges.txt
+     * @return true on success
+     */
+    bool load(const std::string& vocab_path, const std::string& merges_path);
+
+    /**
+     * Load tokenizer from a directory containing vocab.json and merges.txt
+     * @param tokenizer_dir  Path to directory
+     * @return true on success
+     */
+    bool load_from_dir(const std::string& tokenizer_dir);
+
+    /**
+     * Tokenize text to token IDs with padding/truncation.
+     * Adds <|startoftext|> and <|endoftext|> automatically.
+     *
+     * @param text      Input text string
+     * @param max_len   Maximum sequence length (default: 64)
+     * @return Vector of int64_t token IDs with shape [max_len]
+     */
+    std::vector<int64_t> encode(const std::string& text, int max_len = 64) const;
+
+    /**
+     * Check if tokenizer is loaded
+     */
+    bool is_loaded() const { return loaded_; }
+
+    /**
+     * Get vocabulary size
+     */
+    size_t vocab_size() const { return token_to_id_.size(); }
+
+private:
+    // BPE pair
+    using BPEPair = std::pair<std::string, std::string>;
+
+    // Byte-to-unicode mapping (GPT-2 style)
+    std::unordered_map<uint8_t, char32_t> byte_to_unicode_;
+    std::unordered_map<char32_t, uint8_t> unicode_to_byte_;
+
+    // Vocabulary
+    std::unordered_map<std::string, int> token_to_id_;
+    std::unordered_map<int, std::string> id_to_token_;
+
+    // BPE merge rules (pair -> priority rank)
+    std::map<BPEPair, int> bpe_ranks_;
+
+    // Special token IDs
+    int sot_token_id_ = 49406;  // <|startoftext|>
+    int eot_token_id_ = 49407;  // <|endoftext|>
+
+    bool loaded_ = false;
+
+    // Initialize byte-to-unicode mapping
+    void init_byte_to_unicode();
+
+    // Convert UTF-8 string to vector of unicode codepoints
+    static std::vector<char32_t> utf8_to_codepoints(const std::string& str);
+
+    // Convert unicode codepoints to UTF-8 string
+    static std::string codepoints_to_utf8(const std::vector<char32_t>& cps);
+
+    // Apply BPE to a single word (already converted to unicode representation)
+    std::vector<std::string> bpe(const std::string& token) const;
+
+    // Clean and split text using CLIP's regex pattern
+    std::vector<std::string> pre_tokenize(const std::string& text) const;
+
+    // Convert raw bytes to unicode string using byte_to_unicode mapping
+    std::string bytes_to_unicode_str(const std::string& raw) const;
+};
+
+#endif // CLIP_TOKENIZER_H
+
--- a/examples/clip/cpp/src/main.cpp
+++ b/examples/clip/cpp/src/main.cpp
@ -15,22 +15,26 @@
 */

 #include <iostream>
+#include <fstream>
+#include <sstream>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
+#include <vector>
+#include <string>
+#include <algorithm>

-#include "model_invoke.h"
+#include "clip_process.h"
+#include "clip_tokenizer.h"

 #define BILLION 1000000000

-struct Get_Times
+struct ProfilingTimer
 {
-    uint64_t init_start_time, init_end_time, init_total_time;
-    uint64_t preProcess_start_time, preProcess_end_time, preProcess_total_time;
-    uint64_t invoke_start_time, invoke_end_time, invoke_total_time;
-    uint64_t postProcess_start_time, postProcess_end_time, postProcess_total_time;
-    uint64_t total_time;
-    std::vector<uint64_t> total_time_group;
+    uint64_t init_start, init_end;
+    uint64_t preprocess_start, preprocess_end;
+    uint64_t vision_infer_start, vision_infer_end;
+    uint64_t text_infer_start, text_infer_end;
 };

 static uint64_t get_time_count()
@ -40,70 +44,288 @@ static uint64_t get_time_count()
    return (uint64_t)((uint64_t)ts.tv_nsec + (uint64_t)ts.tv_sec * BILLION);
 }

+// Default text prompts for demo
+static std::vector<std::string> default_texts = {
+    "a red handbag",
+    "a blue jacket",
+    "a red bus"
+};
+
+// Parse comma-separated texts
+std::vector<std::string> parse_texts(const std::string& input)
+{
+    std::vector<std::string> result;
+    std::stringstream ss(input);
+    std::string item;
+
+    while (std::getline(ss, item, ',')) {
+        // Trim whitespace
+        size_t start = item.find_first_not_of(" \t");
+        size_t end = item.find_last_not_of(" \t");
+        if (start != std::string::npos && end != std::string::npos) {
+            result.push_back(item.substr(start, end - start + 1));
+        }
+    }
+    return result;
+}
+
+void print_usage(const char* prog_name)
+{
+    printf("Usage: %s <vision_model> <text_model> <tokenizer_dir> [--profiling]\n", prog_name);
+    printf("\n");
+    printf("Arguments:\n");
+    printf("  vision_model:   Path to vision model (.adla)\n");
+    printf("  text_model:     Path to text model (.adla)\n");
+    printf("  tokenizer_dir:  Path to directory containing vocab.json and merges.txt\n");
+    printf("  --profiling:    Enable performance profiling output (optional)\n");
+    printf("\n");
+    printf("Interactive mode:\n");
+    printf("  - Enter image path to process\n");
+    printf("  - Enter comma-separated texts to compare (or 'skip' for defaults)\n");
+    printf("  - Enter 'exit' to quit\n");
+}
+
 int main(int argc, char ** argv)
 {
-    Get_Times model_time;
-
-    std::vector<float> input_data_fir;
-    float* model_output_data;
- 
+    ProfilingTimer timer = {};
    int ret = 0;
-    int max_index = 0;
-    
-    if (argc < 2) {
-        printf("Usage: %s <model_path> [base_dir] [json_filename]\n", argv[0]);
-        printf("  model_path:   Path to the model file\n");
-        printf("  base_dir:     Base directory for clip datasets (optional, can also use CLIP_BASE_DIR env var)\n");
-        printf("  json_filename: JSON filename in each dataset folder (optional, can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)\n");
-        return -1;
-    }
-    
-    char* model_path_encoder = argv[1];
-    std::string base_dir = (argc >= 3) ? argv[2] : "";
-    std::string json_filename = (argc >= 4) ? argv[3] : "";
-    void *context_model = NULL;
+    bool profiling = false;

-    model_time.init_start_time = get_time_count();
-    context_model = init_network_file(model_path_encoder);
-    model_time.init_end_time = get_time_count();
-
-    if (context_model == NULL)
-    {
-        printf("init_network [context_model] fail.\n");
+    if (argc < 4) {
+        print_usage(argv[0]);
        return -1;
    }

-    if (getenv("GET_TIME"))
-    {
-        model_time.init_total_time = (model_time.init_end_time - model_time.init_start_time) / 1000000;
-        std::cout << "init_model_total time : " << model_time.init_total_time << "ms" << std::endl;
+    const char* vision_model_path = argv[1];
+    const char* text_model_path = argv[2];
+    const char* tokenizer_dir = argv[3];
+
+    // Check for --profiling flag
+    for (int i = 4; i < argc; ++i) {
+        if (std::string(argv[i]) == "--profiling") {
+            profiling = true;
+        }
    }

-    while (true)
-    {
-        std::string json_path;
+    const float logit_scale = 100.0f;
+    const int max_seq_len = 64;

-        printf("\nPlease enter the JPG image path (enter exit to quit):\n");
-        std::getline(std::cin, json_path);
-        if (json_path == "exit") break;
-        if (json_path.empty()) {
-            printf("The path cannot be empty.\n");
+    // Load tokenizer
+    printf("[Info] Loading tokenizer from: %s\n", tokenizer_dir);
+    CLIPTokenizer tokenizer;
+    if (!tokenizer.load_from_dir(tokenizer_dir)) {
+        printf("[Error] Failed to load tokenizer.\n");
+        return -1;
+    }
+
+    // Initialize models
+    printf("[Info] Initializing vision model: %s\n", vision_model_path);
+    timer.init_start = get_time_count();
+    void* vision_context = init_network_file(vision_model_path);
+    if (vision_context == NULL) {
+        printf("[Error] Failed to initialize vision model.\n");
+        return -1;
+    }
+
+    printf("[Info] Initializing text model: %s\n", text_model_path);
+    void* text_context = init_network_file(text_model_path);
+    if (text_context == NULL) {
+        printf("[Error] Failed to initialize text model.\n");
+        destroy_network(vision_context);
+        return -1;
+    }
+    timer.init_end = get_time_count();
+
+    if (profiling) {
+        uint64_t init_time = (timer.init_end - timer.init_start) / 1000000;
+        printf("[Profiling] Model initialization: %lums\n", init_time);
+    }
+
+    printf("[Info] Models initialized successfully.\n\n");
+
+    // Interactive loop
+    while (true) {
+        std::string image_path;
+
+        printf("============================================================\n");
+        printf("[Info] Image Path (or 'exit' to quit):\n");
+        std::getline(std::cin, image_path);
+
+        // Trim whitespace
+        size_t start = image_path.find_first_not_of(" \t\r\n");
+        size_t end = image_path.find_last_not_of(" \t\r\n");
+        if (start != std::string::npos && end != std::string::npos) {
+            image_path = image_path.substr(start, end - start + 1);
+        } else {
+            image_path.clear();
+        }
+
+        if (image_path == "exit") {
+            printf("[Info] Exiting...\n");
+            break;
+        }
+
+        if (image_path.empty()) {
+            printf("[Warning] Please enter an image path.\n");
            continue;
        }
-        std::vector<std::string> out_str_path = process_image_dir(context_model, json_path, base_dir, json_filename);

-        for (int i = 0; i < out_str_path.size(); i++)
+        // Check if file exists
        {
-            std::cout << "Index[" << i << "] : " << out_str_path[i] << std::endl;
+            std::ifstream img_file(image_path);
+            if (!img_file.good()) {
+                printf("[Error] Image not found: %s\n", image_path.c_str());
+                continue;
+            }
        }
+
+        // Get texts to compare
+        std::vector<std::string> texts;
+
+        printf("[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):\n");
+        std::string text_input;
+        std::getline(std::cin, text_input);
+
+        // Trim
+        start = text_input.find_first_not_of(" \t\r\n");
+        end = text_input.find_last_not_of(" \t\r\n");
+        if (start != std::string::npos && end != std::string::npos) {
+            text_input = text_input.substr(start, end - start + 1);
+        } else {
+            text_input.clear();
+        }
+
+        if (text_input.empty() || text_input == "skip") {
+            texts = default_texts;
+            printf("[Info] Using default texts\n");
+        } else {
+            texts = parse_texts(text_input);
+        }
+
+        if (texts.empty()) {
+            printf("[Warning] No texts provided.\n");
+            continue;
+        }
+
+        // ==================== Process Image ====================
+        printf("\n[Info] Processing image: %s\n", image_path.c_str());
+
+        timer.preprocess_start = get_time_count();
+        std::vector<float> image_input = preprocess_image(image_path);
+        if (image_input.empty()) {
+            printf("[Error] Failed to preprocess image.\n");
+            continue;
+        }
+        timer.preprocess_end = get_time_count();
+
+        // Run vision model
+        timer.vision_infer_start = get_time_count();
+        std::vector<float> image_embedding = run_vision_model(vision_context, image_input);
+        if (image_embedding.empty()) {
+            printf("[Error] Vision model inference failed.\n");
+            continue;
+        }
+        timer.vision_infer_end = get_time_count();
+
+        // L2 normalize image embedding
+        image_embedding = l2_normalize(image_embedding);
+        printf("[Info] Image embedding size: %zu\n", image_embedding.size());
+
+        // ==================== Process Texts ====================
+        printf("[Info] Processing %zu text(s)...\n", texts.size());
+
+        std::vector<std::vector<float>> text_embeddings;
+        std::vector<uint64_t> text_infer_times;
+        timer.text_infer_start = get_time_count();
+
+        for (size_t i = 0; i < texts.size(); ++i) {
+            // Tokenize text
+            std::vector<int64_t> token_ids = tokenizer.encode(texts[i], max_seq_len);
+            // Run text model
+            uint64_t t_start = get_time_count();
+            std::vector<float> text_emb = run_text_model(text_context, token_ids);
+            uint64_t t_end = get_time_count();
+            text_infer_times.push_back((t_end - t_start) / 1000000);
+
+            if (text_emb.empty()) {
+                printf("[Error] Text model inference failed for: %s\n", texts[i].c_str());
+                continue;
+            }
+
+            // L2 normalize
+            text_emb = l2_normalize(text_emb);
+            text_embeddings.push_back(text_emb);
+        }
+
+        timer.text_infer_end = get_time_count();
+
+        if (text_embeddings.size() != texts.size()) {
+            printf("[Error] Some text embeddings failed.\n");
+            continue;
+        }
+
+        printf("[Info] Text embeddings size: %zu x %zu\n", text_embeddings.size(), 
+               text_embeddings.empty() ? 0 : text_embeddings[0].size());
+
+        // ==================== Compute Similarity ====================
+        std::vector<float> similarities(texts.size());
+        std::vector<float> logits(texts.size());
+
+        for (size_t i = 0; i < texts.size(); ++i) {
+            similarities[i] = compute_similarity(image_embedding, text_embeddings[i], 1.0f);  // cosine sim
+            logits[i] = similarities[i] * logit_scale;
+        }
+
+        // Compute probabilities
+        std::vector<float> probs = softmax(logits);
+
+        // Sort by probability (descending)
+        std::vector<size_t> indices(texts.size());
+        for (size_t i = 0; i < texts.size(); ++i) indices[i] = i;
+        std::sort(indices.begin(), indices.end(),
+            [&probs](size_t a, size_t b) { return probs[a] > probs[b]; });
+
+        // ==================== Print Results ====================
+        printf("\n============================================================\n");
+        printf("CLIP Image-Text Matching Results\n");
+        printf("============================================================\n");
+        printf("Image: %s\n", image_path.c_str());
+        printf("logit_scale: %.6f\n", logit_scale);
+        printf("------------------------------------------------------------\n");
+
+        for (size_t rank = 0; rank < indices.size(); ++rank) {
+            size_t i = indices[rank];
+            printf("[%zu] prob=%.6f  sim=%.6f  text='%s'\n",
+                rank + 1, probs[i], similarities[i], texts[i].c_str());
+        }
+        printf("============================================================\n");
+
+        if (profiling) {
+            uint64_t preprocess_time = (timer.preprocess_end - timer.preprocess_start) / 1000000;
+            uint64_t vision_time = (timer.vision_infer_end - timer.vision_infer_start) / 1000000;
+            uint64_t text_total_time = (timer.text_infer_end - timer.text_infer_start) / 1000000;
+            printf("\n[Profiling]\n");
+            printf("  Image preprocess:  %lums\n", preprocess_time);
+            printf("  Vision inference:  %lums\n", vision_time);
+            for (size_t i = 0; i < texts.size() && i < text_infer_times.size(); ++i) {
+                printf("  Text inference[%zu]: %lums  '%s'\n", i, text_infer_times[i], texts[i].c_str());
+            }
+            printf("  Text total:        %lums (%zu texts)\n", text_total_time, texts.size());
+        }
+        printf("\n");
    }

-    ret = destroy_network(context_model);
-    if (ret != 0)
-    {
-        printf("destroy_network [context_model] fail.\n");
-        return -1;
+    // Cleanup
+    ret = destroy_network(vision_context);
+    if (ret != 0) {
+        printf("[Error] Failed to destroy vision model.\n");
    }

-    return ret;
-}
+    ret = destroy_network(text_context);
+    if (ret != 0) {
+        printf("[Error] Failed to destroy text model.\n");
+    }
+
+    printf("[Info] Done.\n");
+    return 0;
+}
--- a/examples/clip/cpp/src/model_invoke.cpp
+++ b/examples/clip/cpp/src/model_invoke.cpp
@ -20,31 +20,20 @@
 #include <fstream>
 #include <algorithm>
 #include <vector>
+#include <cmath>
 #include <cstdlib>

-#include "model_invoke.h"
+#include "clip_process.h"
 #include "nn_sdk.h"
-#include "json.hpp"
-#include <filesystem>
-#include <regex>

-using json = nlohmann::ordered_json;
-namespace fs = std::__fs::filesystem;
+// Global DMA config for models
+static aml_memory_config_t vision_mem_config;
+static aml_memory_data_t vision_mem_data;
+static void* vision_context_flag = nullptr;

-struct DMAConfig {
-    bool use_dma = true;
-    bool malloc_buffer_once = true;
-};
-
-DMAConfig context_model;
-
-///////////////////////////////////////////////////////////
-
-aml_memory_config_t mem_config_context_model;
-aml_memory_data_t mem_data_context_model;
-
-std::vector<float> preprocess_image(const std::string& image_path);
-float post_process(const float* a, const std::vector<float>& b);
+static aml_memory_config_t text_mem_config;
+static aml_memory_data_t text_mem_data;
+static void* text_context_flag = nullptr;

 void* init_network_file(const char *model_path)
 {
@ -95,202 +84,119 @@ void* init_network_file(const char *model_path)
    return qcontext;
 }

-float* run_network(void *qcontext, std::vector<float> input_ids, const std::string image_type)
+std::vector<float> run_vision_model(void* qcontext, const std::vector<float>& input_data)
 {
    int ret = 0;
    nn_input inData;
-
    nn_output *outdata = NULL;
    aml_output_config_t outconfig;

    inData.input_index = 0;
    inData.info.input_format = AML_INPUT_DEFAULT;
-    inData.size = input_ids.size() * sizeof(float);
+    inData.size = input_data.size() * sizeof(float);

-    if (context_model.use_dma) {
-        if (context_model.malloc_buffer_once) {
-            mem_config_context_model.cache_type = AML_WITH_CACHE;
-            mem_config_context_model.memory_type = AML_VIRTUAL_ADDR;
-            mem_config_context_model.direction = AML_MEM_DIRECTION_READ_WRITE;
-            mem_config_context_model.index = 0;
-            mem_config_context_model.mem_size = inData.size;
-            aml_util_mallocBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
-            aml_util_swapExternalInputBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
-        }
-
-        inData.input_type = INPUT_DMA_DATA;
-        memcpy(mem_data_context_model.viraddr, input_ids.data(), mem_config_context_model.mem_size);
-        inData.input = NULL;
-    } else {
-        inData.input = reinterpret_cast<unsigned char*>(input_ids.data());
-        inData.input_type = BINARY_RAW_DATA;
-
-        ret = aml_module_input_set(qcontext, &inData);
-        if (ret)
-        {
-            printf("aml_module_input_set fail.\n");
-        }
+    // Use DMA
+    if (!vision_context_flag) {
+        vision_mem_config.cache_type = AML_WITH_CACHE;
+        vision_mem_config.memory_type = AML_VIRTUAL_ADDR;
+        vision_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
+        vision_mem_config.index = 0;
+        vision_mem_config.mem_size = inData.size;
+        aml_util_mallocBuffer(qcontext, &vision_mem_config, &vision_mem_data);
+        aml_util_swapExternalInputBuffer(qcontext, &vision_mem_config, &vision_mem_data);
+        vision_context_flag = qcontext;
    }
-    context_model.malloc_buffer_once = false;
+
+    inData.input_type = INPUT_DMA_DATA;
+    memcpy(vision_mem_data.viraddr, input_data.data(), vision_mem_config.mem_size);
+    inData.input = NULL;

    memset(&outconfig, 0, sizeof(aml_output_config_t));
-
-    if (context_model.use_dma) {
-        outconfig.format = AML_OUTDATA_DMA;
-    } else {
-        outconfig.format = AML_OUTDATA_RAW;
-    }
+    outconfig.format = AML_OUTDATA_DMA;
    outconfig.typeSize = sizeof(aml_output_config_t);
    outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);

-    return reinterpret_cast<float*>(outdata->out[0].buf);
-}
-
-int extract_index(const std::string& filename) {
-    std::regex pattern(R"(test_\w+_(\d+)\.jpg)");
-    std::smatch match;
-    if (std::regex_match(filename, match, pattern)) {
-        return std::stoi(match[1]);
+    if (outdata == NULL || outdata->out[0].buf == NULL) {
+        printf("Vision model inference failed.\n");
+        return {};
    }
-    return -1;
+
+    // Copy output to vector
+    size_t output_size = outdata->out[0].size / sizeof(float);
+    float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
+    std::vector<float> result(output_ptr, output_ptr + output_size);
+
+    return result;
 }

-std::vector<std::string> process_image_dir(
-    void* context_model,
-    const std::string& image_dir_path,
-    const std::string& base_dir,
-    const std::string& json_filename)
+std::vector<float> run_text_model(void* qcontext, const std::vector<int64_t>& input_ids)
 {
-    std::vector<std::string> results;
-    std::regex file_pattern(R"(test_(\w+)_\d+\.jpg)");
-    
-    // Get base_dir from parameter, environment variable, or use default
-    std::string actual_base_dir = base_dir;
-    if (actual_base_dir.empty()) {
-        const char* env_base_dir = std::getenv("CLIP_BASE_DIR");
-        if (env_base_dir != nullptr) {
-            actual_base_dir = env_base_dir;
-        } else {
-            actual_base_dir = "./demo_data/clip_datasets/";
-        }
-    }
-    
-    // Ensure base_dir ends with '/'
-    if (!actual_base_dir.empty() && actual_base_dir.back() != '/') {
-        actual_base_dir += "/";
-    }
-    
-    // Get json_filename from parameter, environment variable, or use default
-    std::string actual_json_filename = json_filename;
-    if (actual_json_filename.empty()) {
-        const char* env_json_filename = std::getenv("CLIP_JSON_FILENAME");
-        if (env_json_filename != nullptr) {
-            actual_json_filename = env_json_filename;
-        } else {
-            actual_json_filename = "clip_text_res.json";
-        }
+    int ret = 0;
+    nn_input inData;
+    nn_output *outdata = NULL;
+    aml_output_config_t outconfig;
+
+    inData.input_index = 0;
+    inData.info.input_format = AML_INPUT_DEFAULT;
+    inData.size = input_ids.size() * sizeof(int64_t);
+
+    // Use DMA
+    if (!text_context_flag) {
+        text_mem_config.cache_type = AML_WITH_CACHE;
+        text_mem_config.memory_type = AML_VIRTUAL_ADDR;
+        text_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
+        text_mem_config.index = 0;
+        text_mem_config.mem_size = inData.size;
+        aml_util_mallocBuffer(qcontext, &text_mem_config, &text_mem_data);
+        aml_util_swapExternalInputBuffer(qcontext, &text_mem_config, &text_mem_data);
+        text_context_flag = qcontext;
    }

-    // storing qualified paths
-    std::vector<fs::directory_entry> matched_files;
+    inData.input_type = INPUT_DMA_DATA;
+    memcpy(text_mem_data.viraddr, input_ids.data(), text_mem_config.mem_size);
+    inData.input = NULL;

-    // collect all relevant img.
-    for (const auto& entry : fs::directory_iterator(image_dir_path)) {
-        if (!entry.is_regular_file()) continue;
+    memset(&outconfig, 0, sizeof(aml_output_config_t));
+    outconfig.format = AML_OUTDATA_DMA;
+    outconfig.typeSize = sizeof(aml_output_config_t);
+    outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);

-        std::string filename = entry.path().filename().string();
-        if (std::regex_match(filename, file_pattern)) {
-            matched_files.push_back(entry);
-        }
+    if (outdata == NULL || outdata->out[0].buf == NULL) {
+        printf("Text model inference failed.\n");
+        return {};
    }

-    // use index sort, test_type_index.jpg
-    std::sort(matched_files.begin(), matched_files.end(),
-        [](const fs::directory_entry& a, const fs::directory_entry& b) {
-            return extract_index(a.path().filename().string()) <
-                   extract_index(b.path().filename().string());
-        });
+    // Copy output to vector
+    size_t output_size = outdata->out[0].size / sizeof(float);
+    float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
+    std::vector<float> result(output_ptr, output_ptr + output_size);

-    for (const auto& entry : matched_files) {
-        if (!entry.is_regular_file()) continue;
-
-        std::string filename = entry.path().filename().string();
-        std::smatch match;
-        if (!std::regex_match(filename, match, file_pattern)) continue;
-
-        std::string name = match[1];
-
-        std::vector<float> input_data = preprocess_image(entry.path().string());
-        float* model_output = run_network(context_model, input_data, name);
-
-        float max_sim = -std::numeric_limits<float>::infinity();
-        std::string best_key, best_id;
-
-        // Iterate through all directories to find the directory containing the name
-        for (const auto& dir_entry : fs::directory_iterator(actual_base_dir)) {
-            if (!dir_entry.is_directory()) continue;
-
-            std::string folder_name = dir_entry.path().filename().string();
-            if (folder_name.find(name) == std::string::npos) continue;
-
-            std::string vit_res_path = actual_base_dir + folder_name + "/" + actual_json_filename;
-            std::ifstream vit_in(vit_res_path);
-            if (!vit_in.is_open()) {
-                printf("unopen: %s\n", vit_res_path.c_str());
-                continue;
-            }
-
-            json vit_json;
-            vit_in >> vit_json;
-
-            for (auto it = vit_json.begin(); it != vit_json.end(); ++it) {
-                const std::string& key = it.key();
-                const std::vector<float> vec = it.value().get<std::vector<float>>();
-                float sim = post_process(model_output, vec);
-                // printf("sim: %.4f\n", sim);
-                if (sim > max_sim) {
-                    max_sim = sim;
-                    best_key = key;
-                    best_id = folder_name;
-                }
-            }
-        }
-
-        if (!best_key.empty() && !best_id.empty()) {
-            std::string best_path = actual_base_dir + best_id + "/";
-            results.push_back(best_path);
-            printf("\nProcessing images: %s, datasets img path: %s\n", filename.c_str(), best_path.c_str());
-            // printf("最相似图片: %s 相似度: %.4f\n", best_path.c_str(), max_sim);    // for debug
-        }
-    }
-
-    return results;
+    return result;
 }

-
 int destroy_network(void *qcontext)
 {
    int ret = 0;

-    /* free model 
-       model.use_dma = true
-       model.malloc_buffer_once = false
-    */
-    if (context_model.use_dma && mem_config_context_model.mem_size != 0) {
-        ret = aml_util_freeBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
-        if (ret)
-        {
-            std::cout << "aml_util_freeBuffer fail." << std::endl;
-        }
+    if (vision_context_flag == qcontext) {
+        printf("Free vision model memory.\n");
+        aml_util_freeBuffer(qcontext, &vision_mem_config, &vision_mem_data);
+        vision_context_flag = nullptr;
+    } else if (text_context_flag == qcontext) {
+        printf("Free text model memory.\n");
+        aml_util_freeBuffer(qcontext, &text_mem_config, &text_mem_data);
+        text_context_flag = nullptr;
+    } else {
+        printf("Free network failed: context not found.\n");
+        return -1;
    }
-    context_model.use_dma = false;

    ret = aml_module_destroy(qcontext);
    if (ret)
    {
-        printf("aml_module_destroy fail.\n");
+        printf("Free network failed: destroy failed.\n");
        return -1;
    }

    return ret;
-}
+}
--- a/examples/clip/cpp/src/pre_postprocess.cpp
+++ b/examples/clip/cpp/src/pre_postprocess.cpp
@ -19,13 +19,13 @@
 #include <algorithm>
 #include <string>
 #include <iostream>
-#include "model_invoke.h"
+#include "clip_process.h"

 #define STB_IMAGE_IMPLEMENTATION
 #include "stb_image.h"

 // bilinear interpolation scaling
-std::vector<float> resize_bilinear(
+static std::vector<float> resize_bilinear(
    const unsigned char* src, int src_w, int src_h, int channels,
    int dst_w, int dst_h)
 {
@ -102,29 +102,29 @@ std::vector<float> preprocess_image(const std::string& image_path) {
        }
    }

-    // get NHWC
+    // Return NHWC format (batch dimension will be added in caller)
    return cropped;
 }

-float post_process(const float* a, const std::vector<float>& b) {
-    float dot = 0.0f, scale = 100.00000762939453f;
-    for (size_t i = 0; i < b.size(); ++i) {
-        dot += a[i] * b[i];
+// ==================== Post Processing ====================
+
+std::vector<float> l2_normalize(const std::vector<float>& vec)
+{
+    float norm = 0.0f;
+    for (float v : vec) {
+        norm += v * v;
    }
-    dot *= scale;
-    return dot;
+    norm = std::sqrt(norm) + 1e-12f;
+
+    std::vector<float> result(vec.size());
+    for (size_t i = 0; i < vec.size(); ++i) {
+        result[i] = vec[i] / norm;
+    }
+    return result;
 }

-float post_process(const int8_t* a, const std::vector<float>& b) {
-    float dot = 0.0f, scale = 100.00000762939453f;
-    for (size_t i = 0; i < b.size(); ++i) {
-        dot += (a[i] - 66) * b[i];
-    }
-    dot *= scale;
-    return dot;
-}
-
-std::vector<float> softmax(const std::vector<float>& logits) {
+std::vector<float> softmax(const std::vector<float>& logits)
+{
    std::vector<float> result(logits.size());

    // numerical stability: subtract the maximum value first.
@ -142,3 +142,17 @@ std::vector<float> softmax(const std::vector<float>& logits) {

    return result;
 }
+
+float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale)
+{
+    if (a.size() != b.size()) {
+        printf("Feature dimension mismatch: %zu vs %zu\n", a.size(), b.size());
+        return 0.0f;
+    }
+
+    float dot = 0.0f;
+    for (size_t i = 0; i < a.size(); ++i) {
+        dot += a[i] * b[i];
+    }
+    return dot * scale;
+}
--- a/examples/clip/py/clip.py
+++ b/examples/clip/py/clip.py
@ -1,304 +1,339 @@
-import numpy as np
-import os
-import argparse
-import json
-import re
-from PIL import Image
-from amlnnlite.api import AMLNNLite
-
-
-def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
-    """
-    Preprocess image for CLIP model.
-    
-    Steps:
-        1. Load image and convert to RGB
-        2. Scale the shorter side to target_size
-        3. Center crop to target_size x target_size
-        4. Normalize with CLIP mean and std
-    
-    Args:
-        image_path (str): Path to input image
-        target_size (int): Target image size (default: 224)
-    
-    Returns:
-        np.ndarray: Preprocessed image data with shape (target_size, target_size, 3)
-    """
-    # Load image
-    img = Image.open(image_path).convert("RGB")
-    width, height = img.size
-    
-    # Scale the shorter side
-    scale = target_size / min(width, height)
-    new_w = int(round(width * scale))
-    new_h = int(round(height * scale))
-    
-    # Resize
-    img = img.resize((new_w, new_h), Image.BILINEAR)
-    
-    # Center crop
-    left = (new_w - target_size) // 2
-    top = (new_h - target_size) // 2
-    img = img.crop((left, top, left + target_size, top + target_size))
-    
-    # Convert to numpy array and normalize to [0, 1]
-    img_array = np.array(img, dtype=np.float32) / 255.0
-    
-    # CLIP normalization
-    mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
-    std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
-    
-    # Normalize: (x - mean) / std
-    img_array = (img_array - mean) / std
-    
-    # Return in NHWC format
-    return img_array
-
-
-def post_process(
-    image_features: np.ndarray,
-    text_features: np.ndarray,
-    scale: float = 100.00000762939453,
-    use_cosine: bool = True,
-    apply_scale: bool = True,
-) -> float:
-    """
-    Calculate similarity between image and text features.
-    
-    Args:
-        image_features (np.ndarray): Image feature vector
-        text_features (np.ndarray): Text feature vector
-        scale (float): Scale factor for similarity calculation
-        use_cosine (bool): If True, L2-normalize both vectors before dot product (cosine similarity)
-        apply_scale (bool): If True, multiply by scale after dot product
-    
-    Returns:
-        float: Similarity score
-    """
-    img_vec = image_features.flatten().astype(np.float32)
-    txt_vec = np.array(text_features, dtype=np.float32).flatten()
-    
-    if len(img_vec) != len(txt_vec):
-        raise ValueError(f"Feature dimension mismatch: image={len(img_vec)}, text={len(txt_vec)}")
-    
-    if use_cosine:
-        img_norm = np.linalg.norm(img_vec) + 1e-8
-        txt_norm = np.linalg.norm(txt_vec) + 1e-8
-        img_vec = img_vec / img_norm
-        txt_vec = txt_vec / txt_norm
-    
-    dot_product = np.dot(img_vec, txt_vec)
-    
-    similarity = dot_product * scale if apply_scale else dot_product
-    
-    return float(similarity)
-
-
-def extract_index(filename: str) -> int:
-    """
-    Extract index from filename pattern: test_xxx_index.jpg
-    
-    Args:
-        filename (str): Filename to extract index from
-    
-    Returns:
-        int: Extracted index, or -1 if pattern doesn't match
-    """
-    pattern = r"test_\w+_(\d+)\.jpg"
-    match = re.match(pattern, filename)
-    if match:
-        return int(match.group(1))
-    return -1
-
-
-def process_image_dir(
-    amlnn: AMLNNLite,
-    image_dir_path: str,
-    base_dir: str = "",
-    json_filename: str = ""
-) -> list:
-    """
-    Process image directory and find best matching text dataset.
-    
-    Args:
-        amlnn: AMLNNLite instance
-        image_dir_path (str): Path to directory containing test images
-        base_dir (str): Base directory for clip datasets (optional, can use CLIP_BASE_DIR env var)
-        json_filename (str): JSON filename in each dataset folder (optional, can use CLIP_JSON_FILENAME env var)
-    
-    Returns:
-        list: List of best matching dataset paths
-    """
-    results = []
-    file_pattern = re.compile(r"test_(\w+)_\d+\.jpg")
-    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.JPG', '.JPEG', '.PNG', '.BMP'}
-    
-    if not base_dir:
-        base_dir = os.getenv("CLIP_BASE_DIR", "./clip_datasets/")
-    
-    if not json_filename:
-        json_filename = os.getenv("CLIP_JSON_FILENAME", "clip_text_res.json")
-    
-    matched_files = []
-    if os.path.isdir(image_dir_path):
-        for filename in os.listdir(image_dir_path):
-            filepath = os.path.join(image_dir_path, filename)
-            if os.path.isfile(filepath):
-                if file_pattern.match(filename):
-                    matched_files.append((filename, filepath, True))  
-                elif any(filename.lower().endswith(ext) for ext in image_extensions):
-                    matched_files.append((filename, filepath, False))  
-    elif os.path.isfile(image_dir_path):
-        filename = os.path.basename(image_dir_path)
-        if any(filename.lower().endswith(ext) for ext in image_extensions):
-            has_pattern = bool(file_pattern.match(filename))
-            matched_files.append((filename, image_dir_path, has_pattern))
-        else:
-            print(f"Error: {image_dir_path} is not a valid image file")
-            return results
-    else:
-        print(f"Error: {image_dir_path} is not a valid directory or file")
-        return results
-    
-    if not matched_files:
-        print(f"Warning: No image files found in {image_dir_path}")
-        return results
-    
-    print(f"Found {len(matched_files)} image file(s) to process")
-    
-    matched_files.sort(key=lambda x: extract_index(x[0]) if x[2] else 999999)
-    
-    # Process each image
-    for filename, filepath, has_pattern in matched_files:
-        if has_pattern:
-            match = file_pattern.match(filename)
-            if match:
-                name = match.group(1)
-            else:
-                name = ""  
-        else:
-            name = ""
-        
-        # Preprocess image
-        try:
-            input_data = preprocess_image(filepath)
-            input_data = np.expand_dims(input_data, axis=0)
-        except Exception as e:
-            print(f"Error preprocessing image {filename}: {e}")
-            continue
-        
-        # Run inference
-        try:
-            outputs = amlnn.inference(inputs=[input_data])
-            model_output = outputs[0]  
-            if isinstance(model_output, np.ndarray):
-                model_output = model_output.astype(np.float32)
-            else:
-                model_output = np.array(model_output, dtype=np.float32)
-            model_output = model_output.flatten()
-        except Exception as e:
-            print(f"Error running inference on {filename}: {e}")
-            continue
-        
-        max_sim = float('-inf')
-        best_key = ""
-        best_id = ""
-        
-        if not os.path.isdir(base_dir):
-            print(f"Error: Base directory does not exist: {base_dir}")
-            continue
-        
-        print(f"Searching in base directory: {base_dir}")
-        folder_count = 0
-        for folder_name in os.listdir(base_dir):
-            folder_path = os.path.join(base_dir, folder_name)
-            if not os.path.isdir(folder_path):
-                continue
-            
-            if has_pattern and name and name not in folder_name:
-                continue
-            
-            folder_count += 1
-            
-            vit_res_path = os.path.join(folder_path, json_filename)
-            if not os.path.isfile(vit_res_path):
-                print(f"Warning: JSON file not found: {vit_res_path}")
-                continue
-            
-            try:
-                with open(vit_res_path, 'r', encoding='utf-8') as f:
-                    vit_json = json.load(f)
-                
-                    for key, text_vec in vit_json.items():
-                        if isinstance(text_vec, list):
-                            text_features = np.array(text_vec, dtype=np.float32)
-                            sim_scaled = post_process(
-                                model_output,
-                                text_features,
-                                use_cosine=True,
-                                apply_scale=True,
-                            )
-                            
-                            if sim_scaled > max_sim:
-                                max_sim = sim_scaled
-                                best_key = key
-                                best_id = folder_name
-            except Exception as e:
-                print(f"Error loading JSON file {vit_res_path}: {e}")
-                continue
-        
-        if best_key and best_id:
-            best_path = os.path.join(base_dir, best_id)
-            results.append(best_path)
-            print(f"\nProcessing image: {filename}")
-            print(f"  Best matching dataset: {best_path}")
-        else:
-            print(f"\nProcessing image: {filename}")
-            print(f"  No matching dataset found (searched {folder_count} folder(s))")
-    
-    return results
-
-
-def main():
-    parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo')
-    parser.add_argument('--model-path', required=True, help='Path to the CLIP model file')
-    parser.add_argument('--base-dir', default='./clip_datasets/', help='Base directory for clip datasets (can also use CLIP_BASE_DIR env var)')
-    parser.add_argument('--json-filename', default='clip_text_res.json', help='JSON filename in each dataset folder (can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)')
-    parser.add_argument('--image-dir', default='./', help='Image directory or single image file to process (optional, will prompt if not provided)')
-    args = parser.parse_args()
-    
-    # Initialize AMLNNLite
-    print("Initializing model...")
-    amlnn = AMLNNLite()
-    amlnn.config(model_path=args.model_path)
-    amlnn.init()
-    print("Model initialized successfully.\n")
-    
-    # Process images
-    if args.image_dir:
-        results = process_image_dir(amlnn, args.image_dir, args.base_dir, args.json_filename)
-        print(f"\nTotal results: {len(results)}")
-        for i, result in enumerate(results):
-            print(f"Index[{i}]: {result}")
-    else:
-        while True:
-            image_path = input("\nPlease enter the JPG image path or directory (enter 'exit' to quit):\n").strip()
-            
-            if image_path.lower() == 'exit':
-                break
-            
-            if not image_path:
-                print("The path cannot be empty.")
-                continue
-            
-            results = process_image_dir(amlnn, image_path, args.base_dir, args.json_filename)
-            
-            for i, result in enumerate(results):
-                print(f"Index[{i}]: {result}")
-    
-    amlnn.uninit()
-    print("\nDone.")
-
-
-if __name__ == "__main__":
-    main()
+# -*- coding: utf-8 -*-
+"""
+Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+
+# This inference script is designed for CLIP model using AMLNNLite.
+
+import os
+import argparse
+import numpy as np
+from PIL import Image
+from transformers import CLIPTokenizer
+from amlnnlite.api import AMLNNLite
+
+# ==================== Utility Functions ====================
+
+def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
+    """Compute softmax values for array x."""
+    x = x - np.max(x, axis=axis, keepdims=True)
+    e = np.exp(x)
+    return e / np.sum(e, axis=axis, keepdims=True)
+
+
+def l2_normalize(x: np.ndarray, axis: int = -1, eps: float = 1e-12) -> np.ndarray:
+    """L2 normalize array x along specified axis."""
+    return x / (np.linalg.norm(x, axis=axis, keepdims=True) + eps)
+
+# ==================== Vision Preprocessing ====================
+
+def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
+    """
+    Preprocess image for CLIP model.
+
+    Args:
+        image_path (str): Path to input image
+        target_size (int): Target image size (default: 224)
+
+    Returns:
+        np.ndarray: Preprocessed image data with shape (1, target_size, target_size, 3) in NHWC format
+    """
+    image = Image.open(image_path).convert("RGB")
+    width, height = image.size
+
+    # Scale the shorter side
+    scale = target_size / min(width, height)
+    new_width = int(width * scale)
+    new_height = int(height * scale)
+    image_resized = image.resize((new_width, new_height), resample=Image.BICUBIC)
+
+    # Center crop
+    left = (new_width - target_size) // 2
+    top = (new_height - target_size) // 2
+    right = left + target_size
+    bottom = top + target_size
+    image_cropped = image_resized.crop((left, top, right, bottom))
+
+    # Convert to numpy array and normalize to [0, 1]
+    image_np = np.array(image_cropped).astype(np.float32) / 255.0
+
+    # CLIP normalization
+    mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
+    std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
+    image_np = (image_np - mean) / std
+
+    # Add batch dimension: HWC -> NHWC
+    image_np = np.expand_dims(image_np, axis=0)
+
+    return image_np.astype(np.float32)  # [1, 224, 224, 3]
+
+# ==================== Text Preprocessing ====================
+
+def preprocess_text(tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
+    """
+    Preprocess text for CLIP model using CLIPTokenizer.
+
+    Args:
+        tokenizer: CLIPTokenizer instance
+        text (str): Input text string
+        max_len (int): Maximum sequence length (default: 64)
+
+    Returns:
+        np.ndarray: Tokenized text with shape (1, max_len) as int64
+    """
+    enc = tokenizer(
+        text,
+        padding="max_length",
+        truncation=True,
+        max_length=max_len,
+        return_tensors="np",
+    )
+    # text model input: int64[1, max_len]
+    input_ids = enc["input_ids"].astype(np.int64)
+    return input_ids
+
+# ==================== Model Inference ====================
+
+def compute_image_embedding(vision_amlnn: AMLNNLite, image_path: str) -> np.ndarray:
+    """
+    Compute image embedding using vision model.
+
+    Args:
+        vision_amlnn: AMLNNLite instance for vision model
+        image_path (str): Path to input image
+    
+    Returns:
+        np.ndarray: L2-normalized image embedding with shape (1, embed_dim)
+    """
+    input_data = preprocess_image(image_path)  # [1, 224, 224, 3]
+
+    outputs = vision_amlnn.inference(
+        inputs=[input_data],
+        inputs_data_format='NHWC',
+        outputs_data_format='NHWC'
+    )
+
+    feats = outputs[0].astype(np.float32)
+    feats = feats.reshape(1, -1)  # Squeeze to [1, embed_dim]
+    return l2_normalize(feats, axis=1)
+
+def compute_text_embedding(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
+    """
+    Compute text embedding using text model.
+
+    Args:
+        text_amlnn: AMLNNLite instance for text model
+        tokenizer: CLIPTokenizer instance
+        text (str): Input text string
+        max_len (int): Maximum sequence length
+
+    Returns:
+        np.ndarray: L2-normalized text embedding with shape (1, embed_dim)
+    """
+    input_ids = preprocess_text(tokenizer, text, max_len)  # [1, max_len]
+    print(f"input_ids: {input_ids}")
+
+    # AMLNNLite requires 4D input, reshape to (1, 1, 1, max_len)
+    input_ids_4d = input_ids[:, None, None, :]  # [1, 1, 1, max_len]
+
+    outputs = text_amlnn.inference(
+        inputs=[input_ids_4d],
+        inputs_data_format='NHWC',
+        outputs_data_format='NHWC'
+    )
+
+    feats = outputs[0].astype(np.float32)
+    feats = feats.reshape(1, -1)  # Squeeze to [1, embed_dim]
+    return l2_normalize(feats, axis=1)
+
+def compute_text_embeddings_batch(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, texts: list, max_len: int = 64) -> np.ndarray:
+    """
+    Compute text embeddings for multiple texts.
+
+    Args:
+        text_amlnn: AMLNNLite instance for text model
+        tokenizer: CLIPTokenizer instance
+        texts (list): List of input text strings
+        max_len (int): Maximum sequence length
+
+    Returns:
+        np.ndarray: L2-normalized text embeddings with shape (num_texts, embed_dim)
+    """
+    embeddings = []
+    for text in texts:
+        emb = compute_text_embedding(text_amlnn, tokenizer, text, max_len)
+        embeddings.append(emb[0])  # Remove batch dimension
+    return np.stack(embeddings, axis=0)  # [num_texts, embed_dim]
+
+# ==================== Similarity Calculation ====================
+
+def compute_similarity(image_embedding: np.ndarray, text_embeddings: np.ndarray, logit_scale: float = 100.0) -> tuple:
+    """
+    Compute similarity between image and text embeddings.
+
+    Args:
+        image_embedding (np.ndarray): Image embedding with shape (1, embed_dim)
+        text_embeddings (np.ndarray): Text embeddings with shape (num_texts, embed_dim)
+        logit_scale (float): Scale factor for logits
+
+    Returns:
+        tuple: (similarities, logits, probabilities)
+    """
+    # Cosine similarity (embeddings are already L2-normalized)
+    sims = text_embeddings @ image_embedding[0]  # [num_texts]
+    logits = sims * logit_scale  # [num_texts]
+    probs = softmax(logits, axis=0)  # [num_texts]
+
+    return sims, logits, probs
+
+# ==================== Main Function ====================
+
+def main():
+    parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo using AMLNNLite')
+    parser.add_argument('--vision-model', required=True, help='Path to vision model (.adla)')
+    parser.add_argument('--text-model', required=True, help='Path to text model (.adla)')
+    parser.add_argument('--tokenizer-dir', required=True, help='Path to CLIPTokenizer directory')
+    parser.add_argument('--image-path', default=None, help='Path to input image (optional, will prompt if not provided)')
+    parser.add_argument('--texts', nargs='+', default=None, help='List of text descriptions to compare')
+    parser.add_argument('--max-len', type=int, default=64, help='Maximum token sequence length (default: 64)')
+    parser.add_argument('--logit-scale', type=float, default=100.0, help='Logit scale factor (default: 100.0)')
+
+    args = parser.parse_args()
+
+    # Validate model paths
+    if not os.path.exists(args.vision_model):
+        print(f"[Error] Vision model not found: {args.vision_model}")
+        return -1
+
+    if not os.path.exists(args.text_model):
+        print(f"[Error] Text model not found: {args.text_model}")
+        return -1
+
+    # Load tokenizer
+    print(f"[Info] Loading CLIPTokenizer from: {args.tokenizer_dir}")
+    tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_dir)
+
+    # Initialize vision model
+    print(f"[Info] Initializing vision model: {args.vision_model}")
+    vision_amlnn = AMLNNLite()
+    vision_amlnn.config(model_path=args.vision_model, run_cycles=1)
+    vision_amlnn.init()
+
+    # Initialize text model
+    print(f"[Info] Initializing text model: {args.text_model}")
+    text_amlnn = AMLNNLite()
+    text_amlnn.config(model_path=args.text_model, run_cycles=1)
+    text_amlnn.init()
+
+    print("[Info] Models initialized successfully.\n")
+
+    try:
+        # Interactive loop
+        while True:
+            # Get image path
+            if args.image_path:
+                image_path = args.image_path
+                args.image_path = None  # Clear for next iteration
+            else:
+                print("=" * 60)
+                print("[Info] Image Path (or 'exit' to quit):")
+                image_path = input().strip()
+
+            # Check for exit
+            if image_path.lower() == 'exit':
+                print("[Info] Exiting...")
+                break
+
+            # Validate image path
+            if not image_path:
+                print("[Warning] Please enter an image path.")
+                continue
+
+            if not os.path.exists(image_path):
+                print(f"[Error] Image not found: {image_path}")
+                continue
+
+            # Get texts to compare
+            if args.texts:
+                texts = args.texts
+                args.texts = None  # Clear for next iteration
+            else:
+                print("[Info] Enter text descriptions (comma-separated, or 'skip' to use defaults):")
+                text_input = input().strip()
+
+                if text_input.lower() == 'skip' or not text_input:
+                    # Default texts for demo
+                    texts = [
+                        "a red handbag",
+                        "a blue jacket",
+                        "a red bus",
+                    ]
+                    print(f"[Info] Using default texts: {texts}")
+                else:
+                    texts = [t.strip() for t in text_input.split(',') if t.strip()]
+
+            if not texts:
+                print("[Warning] No texts provided.")
+                continue
+
+            try:
+                # Compute image embedding
+                print(f"\n[Info] Processing image: {image_path}")
+                image_embedding = compute_image_embedding(vision_amlnn, image_path)
+                print(f"[Info] Image embedding shape: {image_embedding.shape}")
+
+                # Compute text embeddings
+                print(f"[Info] Processing {len(texts)} text(s)...")
+                text_embeddings = compute_text_embeddings_batch(text_amlnn, tokenizer, texts, args.max_len)
+                print(f"[Info] Text embeddings shape: {text_embeddings.shape}")
+
+                # Compute similarity
+                sims, logits, probs = compute_similarity(image_embedding, text_embeddings, args.logit_scale)
+
+                # Print results
+                print("\n" + "=" * 60)
+                print("CLIP Image-Text Matching Results")
+                print("=" * 60)
+                print(f"Image: {image_path}")
+                print(f"logit_scale: {args.logit_scale:.6f}")
+                print("-" * 60)
+
+                # Sort by probability (descending)
+                sorted_indices = np.argsort(probs)[::-1]
+                for rank, i in enumerate(sorted_indices):
+                    print(f"[{rank + 1}] prob={probs[i]:.6f}  sim={float(sims[i]):.6f}  text='{texts[i]}'")
+
+                print("=" * 60 + "\n")
+
+            except Exception as e:
+                print(f"[Error] Processing failed: {e}")
+                import traceback
+                traceback.print_exc()
+                continue
+
+    except KeyboardInterrupt:
+        print("\n\n[Info] Interrupted by user. Exiting...")
+
+    finally:
+        # Cleanup
+        vision_amlnn.uninit()
+        text_amlnn.uninit()
+
+    print("[Info] Done.")
+    return 0
+
+if __name__ == "__main__":
+    import sys
+    sys.exit(main())
--- a/examples/clip/tokenizer_path/merges.txt
+++ b/examples/clip/tokenizer_path/merges.txt
--- a/examples/clip/tokenizer_path/vocab.json
+++ b/examples/clip/tokenizer_path/vocab.json