feat:update demo code of CLIP

2026-02-12 11:19:52 +08:00 · 2026-02-12 11:19:52 +08:00 · 5478a8618b
commit 5478a8618b
parent 4bf4aafc73
12 changed files with 50385 additions and 694 deletions
--- a/examples/clip/000000004505.jpg
+++ b/examples/clip/000000004505.jpg
--- a/examples/clip/README.md
+++ b/examples/clip/README.md
@ -1,95 +1,159 @@
-## Demo Run
+# CLIP
-
+
-### CPP
+## 1. Overview
-
+
-#### 1. Compile
+This demo demonstrates how to run CLIP (Contrastive Language-Image Pre-Training) image-text matching using AMLNNLite. The CLIP model consists of two parts: a vision encoder and a text encoder, which work together to compute similarity between images and text descriptions.
-
+
-**Prerequisites:**
+## 2. Model Download
- Android NDK (r25e recommended)
+
- `ANDROID_NDK_PATH` environment variable set
+TO DO
-
+
-**Build:**
+## 3. Model Conversion
-```bash
+
-# Build for arm64-v8a
+TO DO
-cd examples/clip/cpp
+
-./build-android.sh -a arm64-v8a
+## 4. Demo Run
-```
+
-
+### CPP
-The executable will be generated at `build/android_arm64-v8a/clip_demo` (Note: executable name may vary, verify in build folder).
+
-
+#### 1. Compile
-#### 2. Run
+
-
+**Prerequisites:**
-```bash
+- Android NDK (r25e recommended)
-# Push executable to device
+- `ANDROID_NDK_PATH` environment variable set
-adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
+
-adb push model/vision_model_int8_A311D2.adla /data/local/tmp/
+**Build:**
-adb push clip_datasets/ /data/local/tmp/
+```bash
-adb push test_hat_0.jpg /data/local/tmp/
+# Build for arm64-v8a
-
+cd examples/clip/cpp
-# Run on device
+./build-android.sh -a arm64-v8a
-adb shell
+```
-cd /data/local/tmp
+
-chmod +x clip_demo
+The executable will be generated at `build/android_arm64-v8a/clip_demo`.
-export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
+
-
+#### 2. Run
-# Usage: ./clip_demo <model_path> [base_dir] [json_filename]
+
-./clip_demo vision_model_int8_A311D2.adla ./clip_datasets/ clip_text_res.json
+```bash
-```
+# Push executable and resources to device
-
+adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
-**Note:** 
+adb push model/vision_model_int8_S905X5.adla /data/local/tmp/
- Replace `vision_model_int8_A311D2.adla` with your actual model file path.
+adb push model/text_model_int8_S905X5.adla /data/local/tmp/
- The `base_dir` and `json_filename` parameters are optional. You can also use environment variables `CLIP_BASE_DIR` and `CLIP_JSON_FILENAME`.
+adb push tokenizer_path/ /data/local/tmp/
- The program will prompt you to enter image paths interactively. Enter "exit" to quit.
+
-
+# Run on device
-### Python
+adb shell
-
+cd /data/local/tmp
-**Prerequisites:**
+chmod +x clip_demo
- Python 3.10
+export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
- Required packages: `numpy`, `Pillow`, `amlnnlite`
+
-
+# Usage: ./clip_demo <vision_model> <text_model> <tokenizer_path> [--profiling]
-**Install dependencies:**
+./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path/
-```bash
+```
-pip install numpy Pillow amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
+
-```
+The program will prompt for image paths and text descriptions interactively. Enter the path to an image file, then enter comma-separated text descriptions (or `skip` to use defaults). Type `exit` to quit.
-
+
-**Run on device:**
+**Argument Descriptions:**
-```bash
+
-# Basic usage (process current directory)
+| Argument       | Description                                                  |
-python clip.py --model-path ./vision_model_int8_A311D2.adla
+| -------------- | ------------------------------------------------------------ |
-
+| vision_model   | Path to vision encoder .adla model (required)                |
-# Specify image directory or file
+| text_model     | Path to text encoder .adla model (required)                  |
-python clip.py --model-path ./vision_model_int8_A311D2.adla --image-dir ./
+| tokenizer_path  | Path to directory containing `vocab.json` and `merges.txt` (required) |
-
+| --profiling    | Enable performance profiling output (optional)               |
-# Specify base directory and JSON filename
+
-python clip.py --model-path ./vision_model_int8_A311D2.adla --base-dir ./clip_datasets/ --json-filename clip_text_res.json
+**Note:** The `tokenizer_path` should contain `vocab.json` and `merges.txt` files from the CLIP tokenizer (e.g., from `openai/clip-vit-base-patch32`).
-```
+
-
+### Python
-The script will automatically process all image files (`.jpg`, `.jpeg`, `.png`, `.bmp`) in the specified directory or process a single image file, and display the best matching dataset for each image.
+
-
+**Prerequisites:**
-5. Results
+- Python 3.10
-
+- Required packages: `numpy`, `Pillow`, `transformers`, `amlnnlite`
-The program will print the best matching dataset path for each processed image. The program searches through all dataset folders in the base directory and finds the text feature with the highest similarity to the input image.
+
-
+**Install dependencies:**
-**Example output:**
+```bash
-```
+pip install numpy Pillow transformers amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
-# python demo result
+```
-Model initialized successfully.
+
-
+**Run on device:**
-Found 2 image file(s) to process
+```bash
-Searching in base directory: ./clip_datasets/
+python clip.py \
-
+    --vision-model ./vision_model_int8_S905X5.adla \
-Processing image: test_jacket_0.jpg
+    --text-model ./text_model_int8_S905X5.adla \
-  Best matching dataset: ./clip_datasets/shirt10_jacket7
+    --tokenizer-dir ./tokenizer_path \
-Searching in base directory: ./clip_datasets/
+    --image-path ./000000004505.jpg \
-
+    --texts "a red handbag" "a blue jacket" "a red bus"
-Processing image: test_hat_0.jpg
+```
-  Best matching dataset: ./clip_datasets/hat1_jd
+
-
+**Interactive Mode (Recommended):**
-Total results: 2
+
-Index[0]: ./clip_datasets/shirt10_jacket7
+If you don't provide `--image-path`, the program will run in interactive mode:
-Index[1]: ./clip_datasets/hat1_jd
+
-
+```bash
-Done.
+python clip.py \
-```
+    --vision-model ./vision_model_int8_S905X5.adla \
-
+    --text-model ./text_model_int8_S905X5.adla \
-The program returns the dataset folder path that contains the text feature with the highest similarity to the input image. Each result represents the best matching dataset for the corresponding input image.
+    --tokenizer-dir ./tokenizer_path
 ```
 The program will prompt for image paths and text descriptions. Enter an image path to process, then enter comma-separated texts to compare. Type `exit` to quit.
 **Argument Descriptions:**
 | Argument         | Description                                                  |
 | ---------------- | ------------------------------------------------------------ |
 | --vision-model   | Path to vision encoder .adla model (required)                |
 | --text-model     | Path to text encoder .adla model (required)                  |
 | --tokenizer-dir  | Path to CLIPTokenizer directory (required)                   |
 | --image-path     | Path to input image (.jpg, .png) - optional, will prompt if not provided |
 | --texts          | List of text descriptions to compare (space-separated)       |
 | --max-len        | Maximum token sequence length, default is 64                 |
 | --logit-scale    | Logit scale factor, default is 100.0                         |
 **Note:** The `--tokenizer-dir` should point to the directory containing the CLIPTokenizer files. You can use a Hugging Face model ID (e.g., `openai/clip-vit-base-patch32`) or a local directory.
 ## 5. Results
 **Performance Feedback**
 By using the `--profiling` flag (C++) or setting the loglevel to INFO, the program provides real-time performance metrics upon completion. The console log will display essential hardware and execution details, including:
 - Hardware Information: System and ADLA library versions.
 - Model Overview: Basic input/output configurations.
 - NPU Metrics: Total inference time (latency) and total DRAM bandwidth consumption.
 **Interactive Mode Example:**
 ```bash
 $ ./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path
 [Info] Models initialized successfully.
 ============================================================
 [Info] Image Path (or 'exit' to quit):
 000000004505.jpg
 [Info] Enter text descriptions (comma-separated, or 'skip' for defaults):
 a red handbag, a blue jacket, a red bus
 [Info] Processing image: 000000004505.jpg
 [Info] Image embedding size: 512
 [Info] Processing 3 text(s)...
 [Info] Text embeddings size: 3 x 512
 ============================================================
 CLIP Image-Text Matching Results
 ============================================================
 Image: 000000004505.jpg
 logit_scale: 100.000000
 ------------------------------------------------------------
 [1] prob=0.999975  sim=0.327895  text='a red bus'
 [2] prob=0.000016  sim=0.217690  text='a red handbag'
 [3] prob=0.000008  sim=0.211029  text='a blue jacket'
 ============================================================
 ============================================================
 [Info] Image Path (or 'exit' to quit):
 exit
 [Info] Exiting...
 Free vision model memory.
 Free text model memory.
 [Info] Done.
 ```
--- a/examples/clip/cpp/src/CMakeLists.txt
+++ b/examples/clip/cpp/src/CMakeLists.txt
@ -1,42 +1,43 @@
-cmake_minimum_required(VERSION 3.5)
+cmake_minimum_required(VERSION 3.5)
-project(clip_demo)
+project(clip_demo)
-
+
-set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD 17)
-
+
-# Set NNSDK path
+# Set NNSDK path
-set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
+set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
-include_directories(${NNSDK_ROOT}/include)
+include_directories(${NNSDK_ROOT}/include)
-include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
+include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
-
+
-# Set 3rdparty path
+# Set 3rdparty path
-set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
+set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
-
+
-# Include directories for stb_image and json
+# Include directories for stb_image and json
-# Note: code uses #include "stb_image.h" and #include "json.hpp"
+# Note: code uses #include "stb_image.h" and #include "json.hpp"
-include_directories(${3RDPARTY_DIR}/stb_image)
+include_directories(${3RDPARTY_DIR}/stb_image)
-include_directories(${3RDPARTY_DIR}/json)
+include_directories(${3RDPARTY_DIR}/json)
-
+
-if(CMAKE_SYSTEM_NAME STREQUAL "Android")
+if(CMAKE_SYSTEM_NAME STREQUAL "Android")
-    if (ANDROID_ABI STREQUAL "arm64-v8a")
+    if (ANDROID_ABI STREQUAL "arm64-v8a")
-        link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
+        link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
-    else()
+    else()
-        link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
+        link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
-    endif()
+    endif()
-    # Android needs log
+    # Android needs log
-    link_libraries(log)
+    link_libraries(log)
-elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
+elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
-    link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
+    link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
-endif()
+endif()
-
+
-add_executable(${PROJECT_NAME}
+add_executable(${PROJECT_NAME}
-    main.cpp
+    main.cpp
-    model_invoke.cpp
+    model_invoke.cpp
-    pre_postprocess.cpp
+    pre_postprocess.cpp
-)
+    clip_tokenizer.cpp
-
+)
-target_link_libraries(${PROJECT_NAME}
+
-    nnsdk
+target_link_libraries(${PROJECT_NAME}
-    dl
+    nnsdk
-    m
+    dl
-)
+    m
-
+)
--- a/examples/clip/cpp/src/clip_process.h
+++ b/examples/clip/cpp/src/clip_process.h
@ -0,0 +1,53 @@
 /*
 * Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef CLIP_PROCESS_H
 #define CLIP_PROCESS_H
 #include <string>
 #include <vector>
 #include <cstdint>
 // ==================== Model Invoke ====================
 // Initialize network from file
 void* init_network_file(const char *model_path);
 // Run vision model inference
 std::vector<float> run_vision_model(void* context, const std::vector<float>& input_data);
 // Run text model inference
 std::vector<float> run_text_model(void* context, const std::vector<int64_t>& input_ids);
 // Destroy network
 int destroy_network(void *qcontext);
 // ==================== Pre/Post Processing ====================
 // Image preprocessing
 std::vector<float> preprocess_image(const std::string& image_path);
 // L2 normalize
 std::vector<float> l2_normalize(const std::vector<float>& vec);
 // Softmax
 std::vector<float> softmax(const std::vector<float>& logits);
 // Compute cosine similarity
 float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale = 100.0f);
 #endif // CLIP_PROCESS_H
--- a/examples/clip/cpp/src/clip_tokenizer.cpp
+++ b/examples/clip/cpp/src/clip_tokenizer.cpp
@ -0,0 +1,395 @@
 /*
 * Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "clip_tokenizer.h"
 #include "json.hpp"
 #include <fstream>
 #include <sstream>
 #include <iostream>
 #include <algorithm>
 #include <regex>
 #include <set>
 #include <cassert>
 #include <codecvt>
 #include <locale>
 using json = nlohmann::ordered_json;
 // Reference: https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py
 void CLIPTokenizer::init_byte_to_unicode()
 {
    byte_to_unicode_.clear();
    unicode_to_byte_.clear();
    // Printable ASCII ranges that map to themselves
    // '!' (33) to '~' (126), '¡' (161) to '¬' (172), '®' (174) to 'ÿ' (255)
    std::vector<int> bs;
    for (int i = 33; i <= 126; ++i) bs.push_back(i);    // '!' to '~'
    for (int i = 161; i <= 172; ++i) bs.push_back(i);   // '¡' to '¬'
    for (int i = 174; i <= 255; ++i) bs.push_back(i);   // '®' to 'ÿ'
    std::vector<int> cs(bs.begin(), bs.end());
    // Map remaining bytes (0-32, 127-160, 173) to 256+
    int n = 0;
    for (int b = 0; b < 256; ++b) {
        if (std::find(bs.begin(), bs.end(), b) == bs.end()) {
            bs.push_back(b);
            cs.push_back(256 + n);
            n++;
        }
    }
    for (size_t i = 0; i < bs.size(); ++i) {
        byte_to_unicode_[static_cast<uint8_t>(bs[i])] = static_cast<char32_t>(cs[i]);
        unicode_to_byte_[static_cast<char32_t>(cs[i])] = static_cast<uint8_t>(bs[i]);
    }
 }
 // ========== UTF-8 Helpers ==========
 std::vector<char32_t> CLIPTokenizer::utf8_to_codepoints(const std::string& str)
 {
    std::vector<char32_t> result;
    size_t i = 0;
    while (i < str.size()) {
        char32_t cp = 0;
        unsigned char c = str[i];
        int len = 0;
        if (c < 0x80) {
            cp = c;
            len = 1;
        } else if ((c & 0xE0) == 0xC0) {
            cp = c & 0x1F;
            len = 2;
        } else if ((c & 0xF0) == 0xE0) {
            cp = c & 0x0F;
            len = 3;
        } else if ((c & 0xF8) == 0xF0) {
            cp = c & 0x07;
            len = 4;
        } else {
            ++i;
            continue;
        }
        for (int j = 1; j < len && (i + j) < str.size(); ++j) {
            cp = (cp << 6) | (str[i + j] & 0x3F);
        }
        result.push_back(cp);
        i += len;
    }
    return result;
 }
 std::string CLIPTokenizer::codepoints_to_utf8(const std::vector<char32_t>& cps)
 {
    std::string result;
    for (char32_t cp : cps) {
        if (cp < 0x80) {
            result += static_cast<char>(cp);
        } else if (cp < 0x800) {
            result += static_cast<char>(0xC0 | (cp >> 6));
            result += static_cast<char>(0x80 | (cp & 0x3F));
        } else if (cp < 0x10000) {
            result += static_cast<char>(0xE0 | (cp >> 12));
            result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
            result += static_cast<char>(0x80 | (cp & 0x3F));
        } else {
            result += static_cast<char>(0xF0 | (cp >> 18));
            result += static_cast<char>(0x80 | ((cp >> 12) & 0x3F));
            result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
            result += static_cast<char>(0x80 | (cp & 0x3F));
        }
    }
    return result;
 }
 // ========== Load Functions ==========
 bool CLIPTokenizer::load(const std::string& vocab_path, const std::string& merges_path)
 {
    init_byte_to_unicode();
    // Load vocab.json
    {
        std::ifstream file(vocab_path);
        if (!file.is_open()) {
            std::cerr << "Failed to open vocab file: " << vocab_path << std::endl;
            return false;
        }
        try {
            json j;
            file >> j;
            for (auto it = j.begin(); it != j.end(); ++it) {
                std::string token = it.key();
                int id = it.value().get<int>();
                token_to_id_[token] = id;
                id_to_token_[id] = token;
            }
        } catch (const std::exception& e) {
            std::cerr << "Error parsing vocab.json: " << e.what() << std::endl;
            return false;
        }
    }
    // Find special token IDs
    if (token_to_id_.count("<|startoftext|>")) {
        sot_token_id_ = token_to_id_["<|startoftext|>"];
    }
    if (token_to_id_.count("<|endoftext|>")) {
        eot_token_id_ = token_to_id_["<|endoftext|>"];
    }
    // Load merges.txt
    {
        std::ifstream file(merges_path);
        if (!file.is_open()) {
            std::cerr << "Failed to open merges file: " << merges_path << std::endl;
            return false;
        }
        std::string line;
        int rank = 0;
        // Skip header line "#version: ..." if present
        if (std::getline(file, line)) {
            if (line.find("#version") == std::string::npos) {
                // First line is not a header, process it
                std::istringstream iss(line);
                std::string a, b;
                if (iss >> a >> b) {
                    bpe_ranks_[{a, b}] = rank++;
                }
            }
        }
        while (std::getline(file, line)) {
            if (line.empty()) continue;
            std::istringstream iss(line);
            std::string a, b;
            if (iss >> a >> b) {
                bpe_ranks_[{a, b}] = rank++;
            }
        }
    }
    loaded_ = true;
    printf("[Info] CLIPTokenizer loaded: vocab_size=%zu, merges=%zu\n",
           token_to_id_.size(), bpe_ranks_.size());
    return true;
 }
 bool CLIPTokenizer::load_from_dir(const std::string& tokenizer_dir)
 {
    std::string dir = tokenizer_dir;
    // Ensure trailing slash
    if (!dir.empty() && dir.back() != '/' && dir.back() != '\\') {
        dir += "/";
    }
    return load(dir + "vocab.json", dir + "merges.txt");
 }
 // ========== BPE Implementation ==========
 std::string CLIPTokenizer::bytes_to_unicode_str(const std::string& raw) const
 {
    std::vector<char32_t> result;
    for (unsigned char c : raw) {
        auto it = byte_to_unicode_.find(c);
        if (it != byte_to_unicode_.end()) {
            result.push_back(it->second);
        }
    }
    return codepoints_to_utf8(result);
 }
 std::vector<std::string> CLIPTokenizer::bpe(const std::string& token) const
 {
    // Convert token to individual unicode characters as strings
    auto codepoints = utf8_to_codepoints(token);
    if (codepoints.empty()) return {};
    // Each character becomes a separate piece
    std::vector<std::string> word;
    for (size_t i = 0; i < codepoints.size(); ++i) {
        std::string piece = codepoints_to_utf8({codepoints[i]});
        // CLIP adds </w> to the last character
        if (i == codepoints.size() - 1) {
            piece += "</w>";
        }
        word.push_back(piece);
    }
    if (word.size() == 1) return word;
    // Iteratively merge the most frequent pairs
    while (true) {
        if (word.size() < 2) break;
        // Find the pair with the lowest rank
        int best_rank = INT_MAX;
        int best_idx = -1;
        for (size_t i = 0; i < word.size() - 1; ++i) {
            auto it = bpe_ranks_.find({word[i], word[i + 1]});
            if (it != bpe_ranks_.end() && it->second < best_rank) {
                best_rank = it->second;
                best_idx = static_cast<int>(i);
            }
        }
        if (best_idx == -1) break;  // No more merges possible
        // Merge the pair at best_idx
        std::string merged = word[best_idx] + word[best_idx + 1];
        std::vector<std::string> new_word;
        for (size_t i = 0; i < word.size(); ++i) {
            if (static_cast<int>(i) == best_idx) {
                new_word.push_back(merged);
                ++i;  // Skip next element
            } else {
                new_word.push_back(word[i]);
            }
        }
        word = new_word;
    }
    return word;
 }
 std::vector<std::string> CLIPTokenizer::pre_tokenize(const std::string& text) const
 {
    // CLIP tokenizer: lowercase + basic clean + split by pattern
    // Pattern from CLIP: <\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+
    // Simplified version for ASCII-dominant text:
    std::string cleaned;
    // Lowercase and basic whitespace normalization
    for (char c : text) {
        if (c >= 'A' && c <= 'Z') {
            cleaned += (c - 'A' + 'a');
        } else {
            cleaned += c;
        }
    }
    // Simple tokenization: split by whitespace and punctuation
    std::vector<std::string> words;
    std::string current;
    for (size_t i = 0; i < cleaned.size(); ++i) {
        char c = cleaned[i];
        if (c == ' ' || c == '\t' || c == '\n' || c == '\r') {
            if (!current.empty()) {
                words.push_back(current);
                current.clear();
            }
            // Add space prefix to next word (CLIP uses space-prefixed tokens)
            if (i + 1 < cleaned.size() && cleaned[i + 1] != ' ') {
                // Next word will get a space prefix via the byte encoding
            }
        } else {
            // Check if punctuation should be separate token
            bool is_alpha_or_digit = (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9');
            bool cur_is_alpha = !current.empty() &&
                ((current.back() >= 'a' && current.back() <= 'z') ||
                 (current.back() >= '0' && current.back() <= '9'));
            if (!current.empty() && !is_alpha_or_digit && cur_is_alpha) {
                // Start new token for punctuation
                words.push_back(current);
                current.clear();
            } else if (!current.empty() && is_alpha_or_digit && !cur_is_alpha) {
                words.push_back(current);
                current.clear();
            }
            current += c;
        }
    }
    if (!current.empty()) {
        words.push_back(current);
    }
    return words;
 }
 // ========== Encode ==========
 std::vector<int64_t> CLIPTokenizer::encode(const std::string& text, int max_len) const
 {
    if (!loaded_) {
        std::cerr << "Tokenizer not loaded!" << std::endl;
        return std::vector<int64_t>(max_len, 0);
    }
    std::vector<int64_t> tokens;
    // Add start-of-text token
    tokens.push_back(sot_token_id_);
    // Pre-tokenize
    std::vector<std::string> words = pre_tokenize(text);
    // Process each word
    for (const auto& word : words) {
        // Convert raw bytes to unicode representation
        std::string unicode_word = bytes_to_unicode_str(word);
        // Apply BPE
        std::vector<std::string> bpe_tokens = bpe(unicode_word);
        // Look up token IDs
        for (const auto& bt : bpe_tokens) {
            auto it = token_to_id_.find(bt);
            if (it != token_to_id_.end()) {
                tokens.push_back(it->second);
            } else {
                // Unknown token, try without </w>
                std::string no_ew = bt;
                if (no_ew.size() >= 4 && no_ew.substr(no_ew.size() - 4) == "</w>") {
                    no_ew = no_ew.substr(0, no_ew.size() - 4);
                }
                auto it2 = token_to_id_.find(no_ew);
                if (it2 != token_to_id_.end()) {
                    tokens.push_back(it2->second);
                }
                // else: skip unknown token
            }
        }
    }
    // Add end-of-text token
    tokens.push_back(eot_token_id_);
    // Truncate if necessary
    if (static_cast<int>(tokens.size()) > max_len) {
        tokens.resize(max_len);
        // Ensure EOT is at the end
        tokens.back() = eot_token_id_;
    }
    // Pad to max_len with EOT token (consistent with HuggingFace CLIPTokenizer)
    while (static_cast<int>(tokens.size()) < max_len) {
        tokens.push_back(eot_token_id_);
    }
    return tokens;
 }
--- a/examples/clip/cpp/src/clip_tokenizer.h
+++ b/examples/clip/cpp/src/clip_tokenizer.h
@ -0,0 +1,105 @@
 /*
 * Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef CLIP_TOKENIZER_H
 #define CLIP_TOKENIZER_H
 #include <string>
 #include <vector>
 #include <map>
 #include <unordered_map>
 class CLIPTokenizer {
 public:
    CLIPTokenizer() = default;
    /**
     * Load tokenizer from vocab.json and merges.txt
     * @param vocab_path   Path to vocab.json
     * @param merges_path  Path to merges.txt
     * @return true on success
     */
    bool load(const std::string& vocab_path, const std::string& merges_path);
    /**
     * Load tokenizer from a directory containing vocab.json and merges.txt
     * @param tokenizer_dir  Path to directory
     * @return true on success
     */
    bool load_from_dir(const std::string& tokenizer_dir);
    /**
     * Tokenize text to token IDs with padding/truncation.
     * Adds <|startoftext|> and <|endoftext|> automatically.
     *
     * @param text      Input text string
     * @param max_len   Maximum sequence length (default: 64)
     * @return Vector of int64_t token IDs with shape [max_len]
     */
    std::vector<int64_t> encode(const std::string& text, int max_len = 64) const;
    /**
     * Check if tokenizer is loaded
     */
    bool is_loaded() const { return loaded_; }
    /**
     * Get vocabulary size
     */
    size_t vocab_size() const { return token_to_id_.size(); }
 private:
    // BPE pair
    using BPEPair = std::pair<std::string, std::string>;
    // Byte-to-unicode mapping (GPT-2 style)
    std::unordered_map<uint8_t, char32_t> byte_to_unicode_;
    std::unordered_map<char32_t, uint8_t> unicode_to_byte_;
    // Vocabulary
    std::unordered_map<std::string, int> token_to_id_;
    std::unordered_map<int, std::string> id_to_token_;
    // BPE merge rules (pair -> priority rank)
    std::map<BPEPair, int> bpe_ranks_;
    // Special token IDs
    int sot_token_id_ = 49406;  // <|startoftext|>
    int eot_token_id_ = 49407;  // <|endoftext|>
    bool loaded_ = false;
    // Initialize byte-to-unicode mapping
    void init_byte_to_unicode();
    // Convert UTF-8 string to vector of unicode codepoints
    static std::vector<char32_t> utf8_to_codepoints(const std::string& str);
    // Convert unicode codepoints to UTF-8 string
    static std::string codepoints_to_utf8(const std::vector<char32_t>& cps);
    // Apply BPE to a single word (already converted to unicode representation)
    std::vector<std::string> bpe(const std::string& token) const;
    // Clean and split text using CLIP's regex pattern
    std::vector<std::string> pre_tokenize(const std::string& text) const;
    // Convert raw bytes to unicode string using byte_to_unicode mapping
    std::string bytes_to_unicode_str(const std::string& raw) const;
 };
 #endif // CLIP_TOKENIZER_H
--- a/examples/clip/cpp/src/main.cpp
+++ b/examples/clip/cpp/src/main.cpp
@ -15,22 +15,26 @@
 */
 #include <iostream>
 #include <fstream>
 #include <sstream>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <vector>
 #include <string>
 #include <algorithm>
-#include "model_invoke.h"
+#include "clip_process.h"
 #include "clip_tokenizer.h"
 #define BILLION 1000000000
-struct Get_Times
+struct ProfilingTimer
 {
-    uint64_t init_start_time, init_end_time, init_total_time;
+    uint64_t init_start, init_end;
-    uint64_t preProcess_start_time, preProcess_end_time, preProcess_total_time;
+    uint64_t preprocess_start, preprocess_end;
-    uint64_t invoke_start_time, invoke_end_time, invoke_total_time;
+    uint64_t vision_infer_start, vision_infer_end;
-    uint64_t postProcess_start_time, postProcess_end_time, postProcess_total_time;
+    uint64_t text_infer_start, text_infer_end;
    uint64_t total_time;
    std::vector<uint64_t> total_time_group;
 };
 static uint64_t get_time_count()
@ -40,70 +44,288 @@ static uint64_t get_time_count()
    return (uint64_t)((uint64_t)ts.tv_nsec + (uint64_t)ts.tv_sec * BILLION);
 }
 // Default text prompts for demo
 static std::vector<std::string> default_texts = {
    "a red handbag",
    "a blue jacket",
    "a red bus"
 };
 // Parse comma-separated texts
 std::vector<std::string> parse_texts(const std::string& input)
 {
    std::vector<std::string> result;
    std::stringstream ss(input);
    std::string item;
    while (std::getline(ss, item, ',')) {
        // Trim whitespace
        size_t start = item.find_first_not_of(" \t");
        size_t end = item.find_last_not_of(" \t");
        if (start != std::string::npos && end != std::string::npos) {
            result.push_back(item.substr(start, end - start + 1));
        }
    }
    return result;
 }
 void print_usage(const char* prog_name)
 {
    printf("Usage: %s <vision_model> <text_model> <tokenizer_dir> [--profiling]\n", prog_name);
    printf("\n");
    printf("Arguments:\n");
    printf("  vision_model:   Path to vision model (.adla)\n");
    printf("  text_model:     Path to text model (.adla)\n");
    printf("  tokenizer_dir:  Path to directory containing vocab.json and merges.txt\n");
    printf("  --profiling:    Enable performance profiling output (optional)\n");
    printf("\n");
    printf("Interactive mode:\n");
    printf("  - Enter image path to process\n");
    printf("  - Enter comma-separated texts to compare (or 'skip' for defaults)\n");
    printf("  - Enter 'exit' to quit\n");
 }
 int main(int argc, char ** argv)
 {
-    Get_Times model_time;
+    ProfilingTimer timer = {};
    std::vector<float> input_data_fir;
    float* model_output_data;
    int ret = 0;
-    int max_index = 0;
+    bool profiling = false;
    if (argc < 2) {
        printf("Usage: %s <model_path> [base_dir] [json_filename]\n", argv[0]);
        printf("  model_path:   Path to the model file\n");
        printf("  base_dir:     Base directory for clip datasets (optional, can also use CLIP_BASE_DIR env var)\n");
        printf("  json_filename: JSON filename in each dataset folder (optional, can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)\n");
        return -1;
    }
    char* model_path_encoder = argv[1];
    std::string base_dir = (argc >= 3) ? argv[2] : "";
    std::string json_filename = (argc >= 4) ? argv[3] : "";
    void *context_model = NULL;
-    model_time.init_start_time = get_time_count();
+    if (argc < 4) {
-    context_model = init_network_file(model_path_encoder);
+        print_usage(argv[0]);
    model_time.init_end_time = get_time_count();
    if (context_model == NULL)
    {
        printf("init_network [context_model] fail.\n");
        return -1;
    }
-    if (getenv("GET_TIME"))
+    const char* vision_model_path = argv[1];
-    {
+    const char* text_model_path = argv[2];
-        model_time.init_total_time = (model_time.init_end_time - model_time.init_start_time) / 1000000;
+    const char* tokenizer_dir = argv[3];
-        std::cout << "init_model_total time : " << model_time.init_total_time << "ms" << std::endl;
+
    // Check for --profiling flag
    for (int i = 4; i < argc; ++i) {
        if (std::string(argv[i]) == "--profiling") {
            profiling = true;
        }
    }
-    while (true)
+    const float logit_scale = 100.0f;
-    {
+    const int max_seq_len = 64;
        std::string json_path;
-        printf("\nPlease enter the JPG image path (enter exit to quit):\n");
+    // Load tokenizer
-        std::getline(std::cin, json_path);
+    printf("[Info] Loading tokenizer from: %s\n", tokenizer_dir);
-        if (json_path == "exit") break;
+    CLIPTokenizer tokenizer;
-        if (json_path.empty()) {
+    if (!tokenizer.load_from_dir(tokenizer_dir)) {
-            printf("The path cannot be empty.\n");
+        printf("[Error] Failed to load tokenizer.\n");
        return -1;
    }
    // Initialize models
    printf("[Info] Initializing vision model: %s\n", vision_model_path);
    timer.init_start = get_time_count();
    void* vision_context = init_network_file(vision_model_path);
    if (vision_context == NULL) {
        printf("[Error] Failed to initialize vision model.\n");
        return -1;
    }
    printf("[Info] Initializing text model: %s\n", text_model_path);
    void* text_context = init_network_file(text_model_path);
    if (text_context == NULL) {
        printf("[Error] Failed to initialize text model.\n");
        destroy_network(vision_context);
        return -1;
    }
    timer.init_end = get_time_count();
    if (profiling) {
        uint64_t init_time = (timer.init_end - timer.init_start) / 1000000;
        printf("[Profiling] Model initialization: %lums\n", init_time);
    }
    printf("[Info] Models initialized successfully.\n\n");
    // Interactive loop
    while (true) {
        std::string image_path;
        printf("============================================================\n");
        printf("[Info] Image Path (or 'exit' to quit):\n");
        std::getline(std::cin, image_path);
        // Trim whitespace
        size_t start = image_path.find_first_not_of(" \t\r\n");
        size_t end = image_path.find_last_not_of(" \t\r\n");
        if (start != std::string::npos && end != std::string::npos) {
            image_path = image_path.substr(start, end - start + 1);
        } else {
            image_path.clear();
        }
        if (image_path == "exit") {
            printf("[Info] Exiting...\n");
            break;
        }
        if (image_path.empty()) {
            printf("[Warning] Please enter an image path.\n");
            continue;
        }
        std::vector<std::string> out_str_path = process_image_dir(context_model, json_path, base_dir, json_filename);
-        for (int i = 0; i < out_str_path.size(); i++)
+        // Check if file exists
        {
-            std::cout << "Index[" << i << "] : " << out_str_path[i] << std::endl;
+            std::ifstream img_file(image_path);
            if (!img_file.good()) {
                printf("[Error] Image not found: %s\n", image_path.c_str());
                continue;
            }
        }
        // Get texts to compare
        std::vector<std::string> texts;
        printf("[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):\n");
        std::string text_input;
        std::getline(std::cin, text_input);
        // Trim
        start = text_input.find_first_not_of(" \t\r\n");
        end = text_input.find_last_not_of(" \t\r\n");
        if (start != std::string::npos && end != std::string::npos) {
            text_input = text_input.substr(start, end - start + 1);
        } else {
            text_input.clear();
        }
        if (text_input.empty() || text_input == "skip") {
            texts = default_texts;
            printf("[Info] Using default texts\n");
        } else {
            texts = parse_texts(text_input);
        }
        if (texts.empty()) {
            printf("[Warning] No texts provided.\n");
            continue;
        }
        // ==================== Process Image ====================
        printf("\n[Info] Processing image: %s\n", image_path.c_str());
        timer.preprocess_start = get_time_count();
        std::vector<float> image_input = preprocess_image(image_path);
        if (image_input.empty()) {
            printf("[Error] Failed to preprocess image.\n");
            continue;
        }
        timer.preprocess_end = get_time_count();
        // Run vision model
        timer.vision_infer_start = get_time_count();
        std::vector<float> image_embedding = run_vision_model(vision_context, image_input);
        if (image_embedding.empty()) {
            printf("[Error] Vision model inference failed.\n");
            continue;
        }
        timer.vision_infer_end = get_time_count();
        // L2 normalize image embedding
        image_embedding = l2_normalize(image_embedding);
        printf("[Info] Image embedding size: %zu\n", image_embedding.size());
        // ==================== Process Texts ====================
        printf("[Info] Processing %zu text(s)...\n", texts.size());
        std::vector<std::vector<float>> text_embeddings;
        std::vector<uint64_t> text_infer_times;
        timer.text_infer_start = get_time_count();
        for (size_t i = 0; i < texts.size(); ++i) {
            // Tokenize text
            std::vector<int64_t> token_ids = tokenizer.encode(texts[i], max_seq_len);
            // Run text model
            uint64_t t_start = get_time_count();
            std::vector<float> text_emb = run_text_model(text_context, token_ids);
            uint64_t t_end = get_time_count();
            text_infer_times.push_back((t_end - t_start) / 1000000);
            if (text_emb.empty()) {
                printf("[Error] Text model inference failed for: %s\n", texts[i].c_str());
                continue;
            }
            // L2 normalize
            text_emb = l2_normalize(text_emb);
            text_embeddings.push_back(text_emb);
        }
        timer.text_infer_end = get_time_count();
        if (text_embeddings.size() != texts.size()) {
            printf("[Error] Some text embeddings failed.\n");
            continue;
        }
        printf("[Info] Text embeddings size: %zu x %zu\n", text_embeddings.size(), 
               text_embeddings.empty() ? 0 : text_embeddings[0].size());
        // ==================== Compute Similarity ====================
        std::vector<float> similarities(texts.size());
        std::vector<float> logits(texts.size());
        for (size_t i = 0; i < texts.size(); ++i) {
            similarities[i] = compute_similarity(image_embedding, text_embeddings[i], 1.0f);  // cosine sim
            logits[i] = similarities[i] * logit_scale;
        }
        // Compute probabilities
        std::vector<float> probs = softmax(logits);
        // Sort by probability (descending)
        std::vector<size_t> indices(texts.size());
        for (size_t i = 0; i < texts.size(); ++i) indices[i] = i;
        std::sort(indices.begin(), indices.end(),
            [&probs](size_t a, size_t b) { return probs[a] > probs[b]; });
        // ==================== Print Results ====================
        printf("\n============================================================\n");
        printf("CLIP Image-Text Matching Results\n");
        printf("============================================================\n");
        printf("Image: %s\n", image_path.c_str());
        printf("logit_scale: %.6f\n", logit_scale);
        printf("------------------------------------------------------------\n");
        for (size_t rank = 0; rank < indices.size(); ++rank) {
            size_t i = indices[rank];
            printf("[%zu] prob=%.6f  sim=%.6f  text='%s'\n",
                rank + 1, probs[i], similarities[i], texts[i].c_str());
        }
        printf("============================================================\n");
        if (profiling) {
            uint64_t preprocess_time = (timer.preprocess_end - timer.preprocess_start) / 1000000;
            uint64_t vision_time = (timer.vision_infer_end - timer.vision_infer_start) / 1000000;
            uint64_t text_total_time = (timer.text_infer_end - timer.text_infer_start) / 1000000;
            printf("\n[Profiling]\n");
            printf("  Image preprocess:  %lums\n", preprocess_time);
            printf("  Vision inference:  %lums\n", vision_time);
            for (size_t i = 0; i < texts.size() && i < text_infer_times.size(); ++i) {
                printf("  Text inference[%zu]: %lums  '%s'\n", i, text_infer_times[i], texts[i].c_str());
            }
            printf("  Text total:        %lums (%zu texts)\n", text_total_time, texts.size());
        }
        printf("\n");
    }
-    ret = destroy_network(context_model);
+    // Cleanup
-    if (ret != 0)
+    ret = destroy_network(vision_context);
-    {
+    if (ret != 0) {
-        printf("destroy_network [context_model] fail.\n");
+        printf("[Error] Failed to destroy vision model.\n");
        return -1;
    }
-    return ret;
+    ret = destroy_network(text_context);
-}
+    if (ret != 0) {
        printf("[Error] Failed to destroy text model.\n");
    }
    printf("[Info] Done.\n");
    return 0;
 }
--- a/examples/clip/cpp/src/model_invoke.cpp
+++ b/examples/clip/cpp/src/model_invoke.cpp
@ -20,31 +20,20 @@
 #include <fstream>
 #include <algorithm>
 #include <vector>
 #include <cmath>
 #include <cstdlib>
-#include "model_invoke.h"
+#include "clip_process.h"
 #include "nn_sdk.h"
 #include "json.hpp"
 #include <filesystem>
 #include <regex>
-using json = nlohmann::ordered_json;
+// Global DMA config for models
-namespace fs = std::__fs::filesystem;
+static aml_memory_config_t vision_mem_config;
 static aml_memory_data_t vision_mem_data;
 static void* vision_context_flag = nullptr;
-struct DMAConfig {
+static aml_memory_config_t text_mem_config;
-    bool use_dma = true;
+static aml_memory_data_t text_mem_data;
-    bool malloc_buffer_once = true;
+static void* text_context_flag = nullptr;
 };
 DMAConfig context_model;
 ///////////////////////////////////////////////////////////
 aml_memory_config_t mem_config_context_model;
 aml_memory_data_t mem_data_context_model;
 std::vector<float> preprocess_image(const std::string& image_path);
 float post_process(const float* a, const std::vector<float>& b);
 void* init_network_file(const char *model_path)
 {
@ -95,202 +84,119 @@ void* init_network_file(const char *model_path)
    return qcontext;
 }
-float* run_network(void *qcontext, std::vector<float> input_ids, const std::string image_type)
+std::vector<float> run_vision_model(void* qcontext, const std::vector<float>& input_data)
 {
    int ret = 0;
    nn_input inData;
    nn_output *outdata = NULL;
    aml_output_config_t outconfig;
    inData.input_index = 0;
    inData.info.input_format = AML_INPUT_DEFAULT;
-    inData.size = input_ids.size() * sizeof(float);
+    inData.size = input_data.size() * sizeof(float);
-    if (context_model.use_dma) {
+    // Use DMA
-        if (context_model.malloc_buffer_once) {
+    if (!vision_context_flag) {
-            mem_config_context_model.cache_type = AML_WITH_CACHE;
+        vision_mem_config.cache_type = AML_WITH_CACHE;
-            mem_config_context_model.memory_type = AML_VIRTUAL_ADDR;
+        vision_mem_config.memory_type = AML_VIRTUAL_ADDR;
-            mem_config_context_model.direction = AML_MEM_DIRECTION_READ_WRITE;
+        vision_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
-            mem_config_context_model.index = 0;
+        vision_mem_config.index = 0;
-            mem_config_context_model.mem_size = inData.size;
+        vision_mem_config.mem_size = inData.size;
-            aml_util_mallocBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
+        aml_util_mallocBuffer(qcontext, &vision_mem_config, &vision_mem_data);
-            aml_util_swapExternalInputBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
+        aml_util_swapExternalInputBuffer(qcontext, &vision_mem_config, &vision_mem_data);
-        }
+        vision_context_flag = qcontext;
        inData.input_type = INPUT_DMA_DATA;
        memcpy(mem_data_context_model.viraddr, input_ids.data(), mem_config_context_model.mem_size);
        inData.input = NULL;
    } else {
        inData.input = reinterpret_cast<unsigned char*>(input_ids.data());
        inData.input_type = BINARY_RAW_DATA;
        ret = aml_module_input_set(qcontext, &inData);
        if (ret)
        {
            printf("aml_module_input_set fail.\n");
        }
    }
-    context_model.malloc_buffer_once = false;
+
    inData.input_type = INPUT_DMA_DATA;
    memcpy(vision_mem_data.viraddr, input_data.data(), vision_mem_config.mem_size);
    inData.input = NULL;
    memset(&outconfig, 0, sizeof(aml_output_config_t));
-
+    outconfig.format = AML_OUTDATA_DMA;
    if (context_model.use_dma) {
        outconfig.format = AML_OUTDATA_DMA;
    } else {
        outconfig.format = AML_OUTDATA_RAW;
    }
    outconfig.typeSize = sizeof(aml_output_config_t);
    outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
-    return reinterpret_cast<float*>(outdata->out[0].buf);
+    if (outdata == NULL || outdata->out[0].buf == NULL) {
-}
+        printf("Vision model inference failed.\n");
-
+        return {};
 int extract_index(const std::string& filename) {
    std::regex pattern(R"(test_\w+_(\d+)\.jpg)");
    std::smatch match;
    if (std::regex_match(filename, match, pattern)) {
        return std::stoi(match[1]);
    }
-    return -1;
+
    // Copy output to vector
    size_t output_size = outdata->out[0].size / sizeof(float);
    float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
    std::vector<float> result(output_ptr, output_ptr + output_size);
    return result;
 }
-std::vector<std::string> process_image_dir(
+std::vector<float> run_text_model(void* qcontext, const std::vector<int64_t>& input_ids)
    void* context_model,
    const std::string& image_dir_path,
    const std::string& base_dir,
    const std::string& json_filename)
 {
-    std::vector<std::string> results;
+    int ret = 0;
-    std::regex file_pattern(R"(test_(\w+)_\d+\.jpg)");
+    nn_input inData;
-    
+    nn_output *outdata = NULL;
-    // Get base_dir from parameter, environment variable, or use default
+    aml_output_config_t outconfig;
-    std::string actual_base_dir = base_dir;
+
-    if (actual_base_dir.empty()) {
+    inData.input_index = 0;
-        const char* env_base_dir = std::getenv("CLIP_BASE_DIR");
+    inData.info.input_format = AML_INPUT_DEFAULT;
-        if (env_base_dir != nullptr) {
+    inData.size = input_ids.size() * sizeof(int64_t);
-            actual_base_dir = env_base_dir;
+
-        } else {
+    // Use DMA
-            actual_base_dir = "./demo_data/clip_datasets/";
+    if (!text_context_flag) {
-        }
+        text_mem_config.cache_type = AML_WITH_CACHE;
-    }
+        text_mem_config.memory_type = AML_VIRTUAL_ADDR;
-    
+        text_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
-    // Ensure base_dir ends with '/'
+        text_mem_config.index = 0;
-    if (!actual_base_dir.empty() && actual_base_dir.back() != '/') {
+        text_mem_config.mem_size = inData.size;
-        actual_base_dir += "/";
+        aml_util_mallocBuffer(qcontext, &text_mem_config, &text_mem_data);
-    }
+        aml_util_swapExternalInputBuffer(qcontext, &text_mem_config, &text_mem_data);
-    
+        text_context_flag = qcontext;
    // Get json_filename from parameter, environment variable, or use default
    std::string actual_json_filename = json_filename;
    if (actual_json_filename.empty()) {
        const char* env_json_filename = std::getenv("CLIP_JSON_FILENAME");
        if (env_json_filename != nullptr) {
            actual_json_filename = env_json_filename;
        } else {
            actual_json_filename = "clip_text_res.json";
        }
    }
-    // storing qualified paths
+    inData.input_type = INPUT_DMA_DATA;
-    std::vector<fs::directory_entry> matched_files;
+    memcpy(text_mem_data.viraddr, input_ids.data(), text_mem_config.mem_size);
    inData.input = NULL;
-    // collect all relevant img.
+    memset(&outconfig, 0, sizeof(aml_output_config_t));
-    for (const auto& entry : fs::directory_iterator(image_dir_path)) {
+    outconfig.format = AML_OUTDATA_DMA;
-        if (!entry.is_regular_file()) continue;
+    outconfig.typeSize = sizeof(aml_output_config_t);
    outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
-        std::string filename = entry.path().filename().string();
+    if (outdata == NULL || outdata->out[0].buf == NULL) {
-        if (std::regex_match(filename, file_pattern)) {
+        printf("Text model inference failed.\n");
-            matched_files.push_back(entry);
+        return {};
        }
    }
-    // use index sort, test_type_index.jpg
+    // Copy output to vector
-    std::sort(matched_files.begin(), matched_files.end(),
+    size_t output_size = outdata->out[0].size / sizeof(float);
-        [](const fs::directory_entry& a, const fs::directory_entry& b) {
+    float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
-            return extract_index(a.path().filename().string()) <
+    std::vector<float> result(output_ptr, output_ptr + output_size);
                   extract_index(b.path().filename().string());
        });
-    for (const auto& entry : matched_files) {
+    return result;
        if (!entry.is_regular_file()) continue;
        std::string filename = entry.path().filename().string();
        std::smatch match;
        if (!std::regex_match(filename, match, file_pattern)) continue;
        std::string name = match[1];
        std::vector<float> input_data = preprocess_image(entry.path().string());
        float* model_output = run_network(context_model, input_data, name);
        float max_sim = -std::numeric_limits<float>::infinity();
        std::string best_key, best_id;
        // Iterate through all directories to find the directory containing the name
        for (const auto& dir_entry : fs::directory_iterator(actual_base_dir)) {
            if (!dir_entry.is_directory()) continue;
            std::string folder_name = dir_entry.path().filename().string();
            if (folder_name.find(name) == std::string::npos) continue;
            std::string vit_res_path = actual_base_dir + folder_name + "/" + actual_json_filename;
            std::ifstream vit_in(vit_res_path);
            if (!vit_in.is_open()) {
                printf("unopen: %s\n", vit_res_path.c_str());
                continue;
            }
            json vit_json;
            vit_in >> vit_json;
            for (auto it = vit_json.begin(); it != vit_json.end(); ++it) {
                const std::string& key = it.key();
                const std::vector<float> vec = it.value().get<std::vector<float>>();
                float sim = post_process(model_output, vec);
                // printf("sim: %.4f\n", sim);
                if (sim > max_sim) {
                    max_sim = sim;
                    best_key = key;
                    best_id = folder_name;
                }
            }
        }
        if (!best_key.empty() && !best_id.empty()) {
            std::string best_path = actual_base_dir + best_id + "/";
            results.push_back(best_path);
            printf("\nProcessing images: %s, datasets img path: %s\n", filename.c_str(), best_path.c_str());
            // printf("最相似图片: %s 相似度: %.4f\n", best_path.c_str(), max_sim);    // for debug
        }
    }
    return results;
 }
 int destroy_network(void *qcontext)
 {
    int ret = 0;
-    /* free model 
+    if (vision_context_flag == qcontext) {
-       model.use_dma = true
+        printf("Free vision model memory.\n");
-       model.malloc_buffer_once = false
+        aml_util_freeBuffer(qcontext, &vision_mem_config, &vision_mem_data);
-    */
+        vision_context_flag = nullptr;
-    if (context_model.use_dma && mem_config_context_model.mem_size != 0) {
+    } else if (text_context_flag == qcontext) {
-        ret = aml_util_freeBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
+        printf("Free text model memory.\n");
-        if (ret)
+        aml_util_freeBuffer(qcontext, &text_mem_config, &text_mem_data);
-        {
+        text_context_flag = nullptr;
-            std::cout << "aml_util_freeBuffer fail." << std::endl;
+    } else {
-        }
+        printf("Free network failed: context not found.\n");
        return -1;
    }
    context_model.use_dma = false;
    ret = aml_module_destroy(qcontext);
    if (ret)
    {
-        printf("aml_module_destroy fail.\n");
+        printf("Free network failed: destroy failed.\n");
        return -1;
    }
    return ret;
-}
+}
--- a/examples/clip/cpp/src/pre_postprocess.cpp
+++ b/examples/clip/cpp/src/pre_postprocess.cpp
@ -19,13 +19,13 @@
 #include <algorithm>
 #include <string>
 #include <iostream>
-#include "model_invoke.h"
+#include "clip_process.h"
 #define STB_IMAGE_IMPLEMENTATION
 #include "stb_image.h"
 // bilinear interpolation scaling
-std::vector<float> resize_bilinear(
+static std::vector<float> resize_bilinear(
    const unsigned char* src, int src_w, int src_h, int channels,
    int dst_w, int dst_h)
 {
@ -102,29 +102,29 @@ std::vector<float> preprocess_image(const std::string& image_path) {
        }
    }
-    // get NHWC
+    // Return NHWC format (batch dimension will be added in caller)
    return cropped;
 }
-float post_process(const float* a, const std::vector<float>& b) {
+// ==================== Post Processing ====================
-    float dot = 0.0f, scale = 100.00000762939453f;
+
-    for (size_t i = 0; i < b.size(); ++i) {
+std::vector<float> l2_normalize(const std::vector<float>& vec)
-        dot += a[i] * b[i];
+{
    float norm = 0.0f;
    for (float v : vec) {
        norm += v * v;
    }
-    dot *= scale;
+    norm = std::sqrt(norm) + 1e-12f;
-    return dot;
+
    std::vector<float> result(vec.size());
    for (size_t i = 0; i < vec.size(); ++i) {
        result[i] = vec[i] / norm;
    }
    return result;
 }
-float post_process(const int8_t* a, const std::vector<float>& b) {
+std::vector<float> softmax(const std::vector<float>& logits)
-    float dot = 0.0f, scale = 100.00000762939453f;
+{
    for (size_t i = 0; i < b.size(); ++i) {
        dot += (a[i] - 66) * b[i];
    }
    dot *= scale;
    return dot;
 }
 std::vector<float> softmax(const std::vector<float>& logits) {
    std::vector<float> result(logits.size());
    // numerical stability: subtract the maximum value first.
@ -142,3 +142,17 @@ std::vector<float> softmax(const std::vector<float>& logits) {
    return result;
 }
 float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale)
 {
    if (a.size() != b.size()) {
        printf("Feature dimension mismatch: %zu vs %zu\n", a.size(), b.size());
        return 0.0f;
    }
    float dot = 0.0f;
    for (size_t i = 0; i < a.size(); ++i) {
        dot += a[i] * b[i];
    }
    return dot * scale;
 }
--- a/examples/clip/py/clip.py
+++ b/examples/clip/py/clip.py
@ -1,304 +1,339 @@
-import numpy as np
+# -*- coding: utf-8 -*-
-import os
+"""
-import argparse
+Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
-import json
+
-import re
+Licensed under the Apache License, Version 2.0 (the "License");
-from PIL import Image
+you may not use this file except in compliance with the License.
-from amlnnlite.api import AMLNNLite
+You may obtain a copy of the License at
-
+
-
+    http://www.apache.org/licenses/LICENSE-2.0
-def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
+
-    """
+Unless required by applicable law or agreed to in writing, software
-    Preprocess image for CLIP model.
+distributed under the License is distributed on an "AS IS" BASIS,
-    
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    Steps:
+See the License for the specific language governing permissions and
-        1. Load image and convert to RGB
+limitations under the License.
-        2. Scale the shorter side to target_size
+"""
-        3. Center crop to target_size x target_size
+
-        4. Normalize with CLIP mean and std
+# This inference script is designed for CLIP model using AMLNNLite.
-    
+
-    Args:
+import os
-        image_path (str): Path to input image
+import argparse
-        target_size (int): Target image size (default: 224)
+import numpy as np
-    
+from PIL import Image
-    Returns:
+from transformers import CLIPTokenizer
-        np.ndarray: Preprocessed image data with shape (target_size, target_size, 3)
+from amlnnlite.api import AMLNNLite
-    """
+
-    # Load image
+# ==================== Utility Functions ====================
-    img = Image.open(image_path).convert("RGB")
+
-    width, height = img.size
+def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
-    
+    """Compute softmax values for array x."""
-    # Scale the shorter side
+    x = x - np.max(x, axis=axis, keepdims=True)
-    scale = target_size / min(width, height)
+    e = np.exp(x)
-    new_w = int(round(width * scale))
+    return e / np.sum(e, axis=axis, keepdims=True)
-    new_h = int(round(height * scale))
+
-    
+
-    # Resize
+def l2_normalize(x: np.ndarray, axis: int = -1, eps: float = 1e-12) -> np.ndarray:
-    img = img.resize((new_w, new_h), Image.BILINEAR)
+    """L2 normalize array x along specified axis."""
-    
+    return x / (np.linalg.norm(x, axis=axis, keepdims=True) + eps)
-    # Center crop
+
-    left = (new_w - target_size) // 2
+# ==================== Vision Preprocessing ====================
-    top = (new_h - target_size) // 2
+
-    img = img.crop((left, top, left + target_size, top + target_size))
+def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
-    
+    """
-    # Convert to numpy array and normalize to [0, 1]
+    Preprocess image for CLIP model.
-    img_array = np.array(img, dtype=np.float32) / 255.0
+
-    
+    Args:
-    # CLIP normalization
+        image_path (str): Path to input image
-    mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
+        target_size (int): Target image size (default: 224)
-    std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
+
-    
+    Returns:
-    # Normalize: (x - mean) / std
+        np.ndarray: Preprocessed image data with shape (1, target_size, target_size, 3) in NHWC format
-    img_array = (img_array - mean) / std
+    """
-    
+    image = Image.open(image_path).convert("RGB")
-    # Return in NHWC format
+    width, height = image.size
-    return img_array
+
-
+    # Scale the shorter side
-
+    scale = target_size / min(width, height)
-def post_process(
+    new_width = int(width * scale)
-    image_features: np.ndarray,
+    new_height = int(height * scale)
-    text_features: np.ndarray,
+    image_resized = image.resize((new_width, new_height), resample=Image.BICUBIC)
-    scale: float = 100.00000762939453,
+
-    use_cosine: bool = True,
+    # Center crop
-    apply_scale: bool = True,
+    left = (new_width - target_size) // 2
-) -> float:
+    top = (new_height - target_size) // 2
-    """
+    right = left + target_size
-    Calculate similarity between image and text features.
+    bottom = top + target_size
-    
+    image_cropped = image_resized.crop((left, top, right, bottom))
-    Args:
+
-        image_features (np.ndarray): Image feature vector
+    # Convert to numpy array and normalize to [0, 1]
-        text_features (np.ndarray): Text feature vector
+    image_np = np.array(image_cropped).astype(np.float32) / 255.0
-        scale (float): Scale factor for similarity calculation
+
-        use_cosine (bool): If True, L2-normalize both vectors before dot product (cosine similarity)
+    # CLIP normalization
-        apply_scale (bool): If True, multiply by scale after dot product
+    mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
-    
+    std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
-    Returns:
+    image_np = (image_np - mean) / std
-        float: Similarity score
+
-    """
+    # Add batch dimension: HWC -> NHWC
-    img_vec = image_features.flatten().astype(np.float32)
+    image_np = np.expand_dims(image_np, axis=0)
-    txt_vec = np.array(text_features, dtype=np.float32).flatten()
+
-    
+    return image_np.astype(np.float32)  # [1, 224, 224, 3]
-    if len(img_vec) != len(txt_vec):
+
-        raise ValueError(f"Feature dimension mismatch: image={len(img_vec)}, text={len(txt_vec)}")
+# ==================== Text Preprocessing ====================
-    
+
-    if use_cosine:
+def preprocess_text(tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
-        img_norm = np.linalg.norm(img_vec) + 1e-8
+    """
-        txt_norm = np.linalg.norm(txt_vec) + 1e-8
+    Preprocess text for CLIP model using CLIPTokenizer.
-        img_vec = img_vec / img_norm
+
-        txt_vec = txt_vec / txt_norm
+    Args:
-    
+        tokenizer: CLIPTokenizer instance
-    dot_product = np.dot(img_vec, txt_vec)
+        text (str): Input text string
-    
+        max_len (int): Maximum sequence length (default: 64)
-    similarity = dot_product * scale if apply_scale else dot_product
+
-    
+    Returns:
-    return float(similarity)
+        np.ndarray: Tokenized text with shape (1, max_len) as int64
-
+    """
-
+    enc = tokenizer(
-def extract_index(filename: str) -> int:
+        text,
-    """
+        padding="max_length",
-    Extract index from filename pattern: test_xxx_index.jpg
+        truncation=True,
-    
+        max_length=max_len,
-    Args:
+        return_tensors="np",
-        filename (str): Filename to extract index from
+    )
-    
+    # text model input: int64[1, max_len]
-    Returns:
+    input_ids = enc["input_ids"].astype(np.int64)
-        int: Extracted index, or -1 if pattern doesn't match
+    return input_ids
-    """
+
-    pattern = r"test_\w+_(\d+)\.jpg"
+# ==================== Model Inference ====================
-    match = re.match(pattern, filename)
+
-    if match:
+def compute_image_embedding(vision_amlnn: AMLNNLite, image_path: str) -> np.ndarray:
-        return int(match.group(1))
+    """
-    return -1
+    Compute image embedding using vision model.
-
+
-
+    Args:
-def process_image_dir(
+        vision_amlnn: AMLNNLite instance for vision model
-    amlnn: AMLNNLite,
+        image_path (str): Path to input image
-    image_dir_path: str,
+    
-    base_dir: str = "",
+    Returns:
-    json_filename: str = ""
+        np.ndarray: L2-normalized image embedding with shape (1, embed_dim)
-) -> list:
+    """
-    """
+    input_data = preprocess_image(image_path)  # [1, 224, 224, 3]
-    Process image directory and find best matching text dataset.
+
-    
+    outputs = vision_amlnn.inference(
-    Args:
+        inputs=[input_data],
-        amlnn: AMLNNLite instance
+        inputs_data_format='NHWC',
-        image_dir_path (str): Path to directory containing test images
+        outputs_data_format='NHWC'
-        base_dir (str): Base directory for clip datasets (optional, can use CLIP_BASE_DIR env var)
+    )
-        json_filename (str): JSON filename in each dataset folder (optional, can use CLIP_JSON_FILENAME env var)
+
-    
+    feats = outputs[0].astype(np.float32)
-    Returns:
+    feats = feats.reshape(1, -1)  # Squeeze to [1, embed_dim]
-        list: List of best matching dataset paths
+    return l2_normalize(feats, axis=1)
-    """
+
-    results = []
+def compute_text_embedding(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
-    file_pattern = re.compile(r"test_(\w+)_\d+\.jpg")
+    """
-    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.JPG', '.JPEG', '.PNG', '.BMP'}
+    Compute text embedding using text model.
-    
+
-    if not base_dir:
+    Args:
-        base_dir = os.getenv("CLIP_BASE_DIR", "./clip_datasets/")
+        text_amlnn: AMLNNLite instance for text model
-    
+        tokenizer: CLIPTokenizer instance
-    if not json_filename:
+        text (str): Input text string
-        json_filename = os.getenv("CLIP_JSON_FILENAME", "clip_text_res.json")
+        max_len (int): Maximum sequence length
-    
+
-    matched_files = []
+    Returns:
-    if os.path.isdir(image_dir_path):
+        np.ndarray: L2-normalized text embedding with shape (1, embed_dim)
-        for filename in os.listdir(image_dir_path):
+    """
-            filepath = os.path.join(image_dir_path, filename)
+    input_ids = preprocess_text(tokenizer, text, max_len)  # [1, max_len]
-            if os.path.isfile(filepath):
+    print(f"input_ids: {input_ids}")
-                if file_pattern.match(filename):
+
-                    matched_files.append((filename, filepath, True))  
+    # AMLNNLite requires 4D input, reshape to (1, 1, 1, max_len)
-                elif any(filename.lower().endswith(ext) for ext in image_extensions):
+    input_ids_4d = input_ids[:, None, None, :]  # [1, 1, 1, max_len]
-                    matched_files.append((filename, filepath, False))  
+
-    elif os.path.isfile(image_dir_path):
+    outputs = text_amlnn.inference(
-        filename = os.path.basename(image_dir_path)
+        inputs=[input_ids_4d],
-        if any(filename.lower().endswith(ext) for ext in image_extensions):
+        inputs_data_format='NHWC',
-            has_pattern = bool(file_pattern.match(filename))
+        outputs_data_format='NHWC'
-            matched_files.append((filename, image_dir_path, has_pattern))
+    )
-        else:
+
-            print(f"Error: {image_dir_path} is not a valid image file")
+    feats = outputs[0].astype(np.float32)
-            return results
+    feats = feats.reshape(1, -1)  # Squeeze to [1, embed_dim]
-    else:
+    return l2_normalize(feats, axis=1)
-        print(f"Error: {image_dir_path} is not a valid directory or file")
+
-        return results
+def compute_text_embeddings_batch(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, texts: list, max_len: int = 64) -> np.ndarray:
-    
+    """
-    if not matched_files:
+    Compute text embeddings for multiple texts.
-        print(f"Warning: No image files found in {image_dir_path}")
+
-        return results
+    Args:
-    
+        text_amlnn: AMLNNLite instance for text model
-    print(f"Found {len(matched_files)} image file(s) to process")
+        tokenizer: CLIPTokenizer instance
-    
+        texts (list): List of input text strings
-    matched_files.sort(key=lambda x: extract_index(x[0]) if x[2] else 999999)
+        max_len (int): Maximum sequence length
-    
+
-    # Process each image
+    Returns:
-    for filename, filepath, has_pattern in matched_files:
+        np.ndarray: L2-normalized text embeddings with shape (num_texts, embed_dim)
-        if has_pattern:
+    """
-            match = file_pattern.match(filename)
+    embeddings = []
-            if match:
+    for text in texts:
-                name = match.group(1)
+        emb = compute_text_embedding(text_amlnn, tokenizer, text, max_len)
-            else:
+        embeddings.append(emb[0])  # Remove batch dimension
-                name = ""  
+    return np.stack(embeddings, axis=0)  # [num_texts, embed_dim]
-        else:
+
-            name = ""
+# ==================== Similarity Calculation ====================
-        
+
-        # Preprocess image
+def compute_similarity(image_embedding: np.ndarray, text_embeddings: np.ndarray, logit_scale: float = 100.0) -> tuple:
-        try:
+    """
-            input_data = preprocess_image(filepath)
+    Compute similarity between image and text embeddings.
-            input_data = np.expand_dims(input_data, axis=0)
+
-        except Exception as e:
+    Args:
-            print(f"Error preprocessing image {filename}: {e}")
+        image_embedding (np.ndarray): Image embedding with shape (1, embed_dim)
-            continue
+        text_embeddings (np.ndarray): Text embeddings with shape (num_texts, embed_dim)
-        
+        logit_scale (float): Scale factor for logits
-        # Run inference
+
-        try:
+    Returns:
-            outputs = amlnn.inference(inputs=[input_data])
+        tuple: (similarities, logits, probabilities)
-            model_output = outputs[0]  
+    """
-            if isinstance(model_output, np.ndarray):
+    # Cosine similarity (embeddings are already L2-normalized)
-                model_output = model_output.astype(np.float32)
+    sims = text_embeddings @ image_embedding[0]  # [num_texts]
-            else:
+    logits = sims * logit_scale  # [num_texts]
-                model_output = np.array(model_output, dtype=np.float32)
+    probs = softmax(logits, axis=0)  # [num_texts]
-            model_output = model_output.flatten()
+
-        except Exception as e:
+    return sims, logits, probs
-            print(f"Error running inference on {filename}: {e}")
+
-            continue
+# ==================== Main Function ====================
-        
+
-        max_sim = float('-inf')
+def main():
-        best_key = ""
+    parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo using AMLNNLite')
-        best_id = ""
+    parser.add_argument('--vision-model', required=True, help='Path to vision model (.adla)')
-        
+    parser.add_argument('--text-model', required=True, help='Path to text model (.adla)')
-        if not os.path.isdir(base_dir):
+    parser.add_argument('--tokenizer-dir', required=True, help='Path to CLIPTokenizer directory')
-            print(f"Error: Base directory does not exist: {base_dir}")
+    parser.add_argument('--image-path', default=None, help='Path to input image (optional, will prompt if not provided)')
-            continue
+    parser.add_argument('--texts', nargs='+', default=None, help='List of text descriptions to compare')
-        
+    parser.add_argument('--max-len', type=int, default=64, help='Maximum token sequence length (default: 64)')
-        print(f"Searching in base directory: {base_dir}")
+    parser.add_argument('--logit-scale', type=float, default=100.0, help='Logit scale factor (default: 100.0)')
-        folder_count = 0
+
-        for folder_name in os.listdir(base_dir):
+    args = parser.parse_args()
-            folder_path = os.path.join(base_dir, folder_name)
+
-            if not os.path.isdir(folder_path):
+    # Validate model paths
-                continue
+    if not os.path.exists(args.vision_model):
-            
+        print(f"[Error] Vision model not found: {args.vision_model}")
-            if has_pattern and name and name not in folder_name:
+        return -1
-                continue
+
-            
+    if not os.path.exists(args.text_model):
-            folder_count += 1
+        print(f"[Error] Text model not found: {args.text_model}")
-            
+        return -1
-            vit_res_path = os.path.join(folder_path, json_filename)
+
-            if not os.path.isfile(vit_res_path):
+    # Load tokenizer
-                print(f"Warning: JSON file not found: {vit_res_path}")
+    print(f"[Info] Loading CLIPTokenizer from: {args.tokenizer_dir}")
-                continue
+    tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_dir)
-            
+
-            try:
+    # Initialize vision model
-                with open(vit_res_path, 'r', encoding='utf-8') as f:
+    print(f"[Info] Initializing vision model: {args.vision_model}")
-                    vit_json = json.load(f)
+    vision_amlnn = AMLNNLite()
-                
+    vision_amlnn.config(model_path=args.vision_model, run_cycles=1)
-                    for key, text_vec in vit_json.items():
+    vision_amlnn.init()
-                        if isinstance(text_vec, list):
+
-                            text_features = np.array(text_vec, dtype=np.float32)
+    # Initialize text model
-                            sim_scaled = post_process(
+    print(f"[Info] Initializing text model: {args.text_model}")
-                                model_output,
+    text_amlnn = AMLNNLite()
-                                text_features,
+    text_amlnn.config(model_path=args.text_model, run_cycles=1)
-                                use_cosine=True,
+    text_amlnn.init()
-                                apply_scale=True,
+
-                            )
+    print("[Info] Models initialized successfully.\n")
-                            
+
-                            if sim_scaled > max_sim:
+    try:
-                                max_sim = sim_scaled
+        # Interactive loop
-                                best_key = key
+        while True:
-                                best_id = folder_name
+            # Get image path
-            except Exception as e:
+            if args.image_path:
-                print(f"Error loading JSON file {vit_res_path}: {e}")
+                image_path = args.image_path
-                continue
+                args.image_path = None  # Clear for next iteration
-        
+            else:
-        if best_key and best_id:
+                print("=" * 60)
-            best_path = os.path.join(base_dir, best_id)
+                print("[Info] Image Path (or 'exit' to quit):")
-            results.append(best_path)
+                image_path = input().strip()
-            print(f"\nProcessing image: {filename}")
+
-            print(f"  Best matching dataset: {best_path}")
+            # Check for exit
-        else:
+            if image_path.lower() == 'exit':
-            print(f"\nProcessing image: {filename}")
+                print("[Info] Exiting...")
-            print(f"  No matching dataset found (searched {folder_count} folder(s))")
+                break
-    
+
-    return results
+            # Validate image path
-
+            if not image_path:
-
+                print("[Warning] Please enter an image path.")
-def main():
+                continue
-    parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo')
+
-    parser.add_argument('--model-path', required=True, help='Path to the CLIP model file')
+            if not os.path.exists(image_path):
-    parser.add_argument('--base-dir', default='./clip_datasets/', help='Base directory for clip datasets (can also use CLIP_BASE_DIR env var)')
+                print(f"[Error] Image not found: {image_path}")
-    parser.add_argument('--json-filename', default='clip_text_res.json', help='JSON filename in each dataset folder (can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)')
+                continue
-    parser.add_argument('--image-dir', default='./', help='Image directory or single image file to process (optional, will prompt if not provided)')
+
-    args = parser.parse_args()
+            # Get texts to compare
-    
+            if args.texts:
-    # Initialize AMLNNLite
+                texts = args.texts
-    print("Initializing model...")
+                args.texts = None  # Clear for next iteration
-    amlnn = AMLNNLite()
+            else:
-    amlnn.config(model_path=args.model_path)
+                print("[Info] Enter text descriptions (comma-separated, or 'skip' to use defaults):")
-    amlnn.init()
+                text_input = input().strip()
-    print("Model initialized successfully.\n")
+
-    
+                if text_input.lower() == 'skip' or not text_input:
-    # Process images
+                    # Default texts for demo
-    if args.image_dir:
+                    texts = [
-        results = process_image_dir(amlnn, args.image_dir, args.base_dir, args.json_filename)
+                        "a red handbag",
-        print(f"\nTotal results: {len(results)}")
+                        "a blue jacket",
-        for i, result in enumerate(results):
+                        "a red bus",
-            print(f"Index[{i}]: {result}")
+                    ]
-    else:
+                    print(f"[Info] Using default texts: {texts}")
-        while True:
+                else:
-            image_path = input("\nPlease enter the JPG image path or directory (enter 'exit' to quit):\n").strip()
+                    texts = [t.strip() for t in text_input.split(',') if t.strip()]
-            
+
-            if image_path.lower() == 'exit':
+            if not texts:
-                break
+                print("[Warning] No texts provided.")
-            
+                continue
-            if not image_path:
+
-                print("The path cannot be empty.")
+            try:
-                continue
+                # Compute image embedding
-            
+                print(f"\n[Info] Processing image: {image_path}")
-            results = process_image_dir(amlnn, image_path, args.base_dir, args.json_filename)
+                image_embedding = compute_image_embedding(vision_amlnn, image_path)
-            
+                print(f"[Info] Image embedding shape: {image_embedding.shape}")
-            for i, result in enumerate(results):
+
-                print(f"Index[{i}]: {result}")
+                # Compute text embeddings
-    
+                print(f"[Info] Processing {len(texts)} text(s)...")
-    amlnn.uninit()
+                text_embeddings = compute_text_embeddings_batch(text_amlnn, tokenizer, texts, args.max_len)
-    print("\nDone.")
+                print(f"[Info] Text embeddings shape: {text_embeddings.shape}")
-
+
-
+                # Compute similarity
-if __name__ == "__main__":
+                sims, logits, probs = compute_similarity(image_embedding, text_embeddings, args.logit_scale)
-    main()
+
                # Print results
                print("\n" + "=" * 60)
                print("CLIP Image-Text Matching Results")
                print("=" * 60)
                print(f"Image: {image_path}")
                print(f"logit_scale: {args.logit_scale:.6f}")
                print("-" * 60)
                # Sort by probability (descending)
                sorted_indices = np.argsort(probs)[::-1]
                for rank, i in enumerate(sorted_indices):
                    print(f"[{rank + 1}] prob={probs[i]:.6f}  sim={float(sims[i]):.6f}  text='{texts[i]}'")
                print("=" * 60 + "\n")
            except Exception as e:
                print(f"[Error] Processing failed: {e}")
                import traceback
                traceback.print_exc()
                continue
    except KeyboardInterrupt:
        print("\n\n[Info] Interrupted by user. Exiting...")
    finally:
        # Cleanup
        vision_amlnn.uninit()
        text_amlnn.uninit()
    print("[Info] Done.")
    return 0
 if __name__ == "__main__":
    import sys
    sys.exit(main())
--- a/examples/clip/tokenizer_path/merges.txt
+++ b/examples/clip/tokenizer_path/merges.txt
--- a/examples/clip/tokenizer_path/vocab.json
+++ b/examples/clip/tokenizer_path/vocab.json