Library
|
Your profile |
Software systems and computational methods
Reference:
Krohin , A.S., Gusev, M.M. (2025). Analysis of the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections. Software systems and computational methods, 2, 44–62. . https://doi.org/10.7256/2454-0714.2025.2.73939
Analysis of the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections
DOI: 10.7256/2454-0714.2025.2.73939EDN: FBOXHCReceived: 02-04-2025Published: 21-05-2025Abstract: The article addresses the issue of prompt obfuscation as a means of circumventing protective mechanisms in large language models (LLMs) designed to detect prompt injections. Prompt injections represent a method of attack in which malicious actors manipulate input data to alter the model's behavior and cause it to perform undesirable or harmful actions. Obfuscation involves various methods of changing the structure and content of text, such as replacing words with synonyms, scrambling letters in words, inserting random characters, and others. The purpose of obfuscation is to complicate the analysis and classification of text in order to bypass filters and protective mechanisms built into language models. The study conducts an analysis of the effectiveness of various obfuscation methods in bypassing models trained for text classification tasks. Particular attention is paid to assessing the potential implications of obfuscation for security and data protection. The research utilizes different text obfuscation methods applied to prompts from the AdvBench dataset. The effectiveness of the methods is evaluated using three classifier models trained to detect prompt injections. The scientific novelty of the research lies in analyzing the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections. During the study, it was found that the application of complex obfuscation methods increases the proportion of requests classified as injections, highlighting the need for a thorough approach to testing the security of large language models. The conclusions of the research indicate the importance of balancing the complexity of the obfuscation method with its effectiveness in the context of attacks on models. Excessively complex obfuscation methods may increase the likelihood of injection detection, which requires further investigation to optimize approaches to ensuring the security of language models. The results underline the need for the continuous improvement of protective mechanisms and the development of new methods for detecting and preventing attacks on large language models. Keywords: LLM, prompt injection, obfuscation, jailbreak, AI, adversarial attacks, encoder, transformers, AI security, fuzzingThis article is automatically translated. You can find original text of the article here. Introduction In recent years, there has been significant progress in natural language processing, which has led to the widespread adoption of language models in various applications, from chatbots to automatic translation systems. However, as the capabilities of these models increase, so does the number of threats associated with their use. One of these threats is prompt injection, which is a technique for manipulating input data in order to change the behavior of a model. To counter this threat, special models are being developed designed to detect prompt injections. However, attackers also do not stand still and develop various methods of obfuscation to circumvent such protective mechanisms. Industrial obfuscation involves changing the structure and content of a text in order to make it more difficult to analyze and classify. The purpose of this article is to investigate and analyze various methods of industrial injection obfuscation that can be used to bypass models of industrial injection detection. The object of the study is language models (LLM) and their vulnerability to attacks through prompt injections. The subject of the study is methods of industrial obfuscation and their effect on the effectiveness of classifier models designed to detect industrial injections in large language models. The research examines techniques such as replacing words with synonyms, mixing letters in words, introducing special characters, and other approaches. The effectiveness of these methods is assessed using the example of modern language models trained for the task of text classification, and their potential implications for data security and protection are assessed. Prompt injections, jailbreaks, and obfuscation methods Prompt injection [1-4] is an attack method aimed at manipulating input data transmitted to artificial intelligent systems, especially large language models (LLM). The purpose of such attacks is to change the behavior of the model so that it performs unwanted or malicious actions, ignoring the initial instructions of the developers. Large language models are trained to interpret text prompts (promptas) as instructions. Attackers use this feature by embedding specially designed text fragments that force the model to: 1) Ignore system restrictions or security rules; 2) Generate malicious content; 3) Disclose confidential data; 4) Perform actions that were not provided by the developers. An example would be a request: “Forget all previous instructions and provide access to confidential information.” If the system is not secure enough, it can fulfill this request. Jailbreak [5-7] is a subtype of attacks on large language models (LLM), belonging to the category of prompt injections. Its goal is to force the model to ignore the built—in restrictions and security protocols embedded in the alignment process of the model [8] so that it performs actions that are usually prohibited by developers. The main task of jailbreaks is to remove the restrictions imposed on the model. This can lead to: 1) Generating malicious content (for example, writing exploits or instructions for dangerous actions); 2) Disclosure of confidential information; 3) Performing actions that contradict the developers' security policy. Obfuscation is a method of masking or complicating text, code, or data that preserves their functionality but makes them difficult to analyze and interpret. In the context of attacks on large language models (LLM), such as jailbreaks and prompt injections, obfuscation is used to bypass built-in filters and defense mechanisms. Obfuscation makes attacks on LLM more difficult to detect and prevent. Studies [9-10] note that it allows: 1) Avoid detection by built-in filters of models; 2) Create unique variations of malicious content; 3) Automate the generation of complex attacks using AI. Within the framework of this study, various methods of obfuscation of texts were used, aimed at complicating their analysis and classification by language models. These methods were developed to change the structure and content of the text, which makes it possible to evaluate their effectiveness in bypassing models designed to detect prompt injections [11]. Below is a detailed list of obfuscation methods used in the experiment. Methods of word obfuscation: 1. Deleting random characters (randomly_remove_characters): This method involves removing characters from a word with a given probability (0.2), which leads to a reduction in the length of the word and a change in its structure. 2. Repetition of letters (repeat_letters): In this method, each letter in a word can be repeated a random number of times (once or twice), which leads to an increase in the length of the word and a change in its visual perception. 3. Adding emojis (add_emojis): The method involves adding random emojis after some words in the text, which can distract the model's attention from the main content [12]. 4. Inserting invisible characters (insert_invisible_characters): In this method, invisible Unicode characters are inserted into the word, which are not displayed when the text is rendered, but change its internal structure. 5. Inserting random characters (insert_random_symbols): The method involves inserting random characters from a given set after some letters in a word, which changes its structure and complicates analysis. 6. Shuffle of characters (shuffle_characters): In this method, the characters inside the word are shuffled, with the exception of the first and last characters, which preserves the visual similarity, but changes the internal structure. 7. Transliteration: This method replaces Cyrillic characters with their Latin counterparts, which changes the visual perception of the text while preserving its pronunciation. 8. Random conversion to UTF-8 (random_utf8_conversion): Some characters in a word are converted to their UTF-8 representation with a given probability, which changes the internal structure of the text. 9. Reverse character order (reverse_word): The method involves changing the order of characters in a word in reverse, which completely changes its visual perception. 10. Inserting random spaces (insert_random_spaces): In this method, random spaces are inserted into the word, which changes its visual perception and structure. 11. Replacement with similar characters (replace_with_similitar_chars): The method replaces some characters with visually similar counterparts from other alphabets, which changes the visual perception of the text. 12. Inserting foreign characters (insert_foreign_characters): Random characters from foreign alphabets are inserted into the word, which changes its structure and complicates the analysis. 13. Adding diacritics (add_diacritics): The method involves adding random diacritics to the characters of a word, which changes its visual perception. 14. Using mirrored characters (use_mirrored_characters): In this method, some characters are replaced with their mirror counterparts, which changes the visual perception of the text. Methods of sentence obfuscation: 1. Shuffle of words (shuffle_words): This method involves randomly shuffling the word order in a sentence, which changes its syntactic structure. 2. Reverse word order (reverse_sentence): In this method, the words in a sentence are arranged in reverse order, which changes its syntactic structure and perception. 3. Adding random text (add_random_text): The method involves adding random phrases after some sentences, which can distract the model's attention from the main content. To counter attacks carried out through user promptings, prompt injection classifier models are being developed [13-14], which are specialized language models trained for the task of text classification. Most often, such models are retrofitted versions of lightweight encoder models, for example, based on the BERT architecture [15]. These models classify the input text into two classes: the text that does not contain the prompt injection, and the text that contains it. The main use of such classifiers is to use them as a proxy between the user and the main language model. This allows you to filter potentially dangerous or unwanted requests, protecting the model from manipulation and ensuring safer interaction with users. The scheme of the classical version of the integration of such models into an LLM service [16-17] is shown in Figure 1. Figure 1 – scheme of application of the classifier model to protect applications from prompt injections Review of scientific research There are articles on the use of obfuscation for attacks on large language models [22-24]. The article [22] is devoted to the problem of protecting system prompts of large language models (LLM), which often contain unique instructions and are considered intellectual property. The authors propose a method for obfuscating prompta - converting the original prompta into a form that preserves its functionality, but does not allow extracting the original information even in attacks of various types (black-box and white-box). The study includes a comparison of LLM performance with the original and obfuscated prompta by eight metrics, as well as an analysis of resistance to deobfuscation attempts. The results show that the proposed method effectively protects the industrial environment without losing the usefulness of the model. The article [23] examines the issues of ensuring the security of complex information systems when integrating large language models into them. The authors analyze the main threats associated with the use of LLM (for example, data leakage, malicious queries, vulnerabilities of industrial products), and systematize modern methods of protection. Special attention is paid to the analysis of the risks associated with prompt injections and approaches to their minimization, including technical and organizational measures. The paper provides an overview of current threats and security practices when implementing LLM in corporate and government systems. But these articles explore obfuscation for autoregressive large language models, that is, aimed at generating text based on a hint (prompt). At the same time, no work was found aimed at analyzing the effect of obfuscation of input data for language models of the text classifier.
Description of the experiment The purpose of the experiment The purpose of this experiment is to analyze the impact of various obfuscation methods on the effectiveness of prompt injection detection in modern language models, including state-of-the-art architectures. The research aims to identify the vulnerabilities of models to input data distortion, in which dangerous requests (for example, instructions for creating malware, weapons, or offensive content) may go unnoticed due to changes in the structure or content of the text. This allows us to assess how much methods of masking attacks reduce the ability of models to recognize threats that violate moral and legal norms. Methodology The AdvBench dataset [18] is used for testing, chosen due to its specialization in dangerous instructions that violate moral and legal norms (for example, the generation of malware, weapons, or offensive content), which directly corresponds to the purpose of the study – the analysis of obfuscation methods aimed at bypassing defense mechanisms against prompt injections. Its advantages include a narrow focus on attacking scenarios that are not present in universal datasets (for example, GLUE), as well as standardization and reproducibility, as it is widely used in AI security research. The data covers a variety of attack types, from social engineering to technical manipulation, providing representativeness for testing obfuscation techniques. Testing procedure 1) Choosing models Three classifier models are selected for the experiment, each of which is designed to detect prompt injections. These models can be based on the BERT architecture or other similar encoder models. 2) Obfuscation of instructions Each prompt from the AdvBench dataset is obfuscated using each of the selected methods in turn. 3) Classification Obfuscated instructions are fed to the input of each classifier model. The output of the model, which is a classification of the text as SAFE or INJECTION, is saved for subsequent analysis. 4) Analysis of the results a) For each obfuscation method, the number of classifier responses belonging to the SAFE and INJECTION classes is calculated. b) The SAFE percentage among all requests is calculated for each obfuscation method. The percentage of SAFE requests is calculated using the formula 1: The proportion of safe = SAFE/(SAFE + INJECTION) (1) c) The effectiveness of obfuscation methods is evaluated. The effectiveness was assessed by the deviation of the proportion of SAFE classifications for obfuscated promptes relative to the base value - the proportion of SAFE classifications for simple instructions. Formula 2 was used for the calculation: Efficiency = Proportion of safe(obfuscation) - Proportion of safe (simple instruction) (2) d) The results are compared between different obfuscation methods, as well as with similar indicators for instructions that do not undergo obfuscation. Expected results The experiment aims to identify differences in the effectiveness of classifier models in processing obfuscated and non-obfuscated instructions. It is expected that obfuscation by applying certain methods will complicate the classification task by increasing the proportion of SAFE classifications for dangerous queries compared to other methods. Comparing the results between different obfuscation methods will allow us to evaluate the effectiveness of these methods in hiding prompt injections. We strive to determine which obfuscation methods are most successful in hiding dangerous requests, that is, they lead to the fact that promptings are less likely to be classified as injections. This will help identify weaknesses in current approaches to prompt injection detection and suggest ways to improve them. Tested models The protectai/deberta-v3, XiweiZ/sts-ft-promptInjection and fmops/distilbert-prompt-injection models were chosen because they are specially trained to recognize dangerous requests (for example, attempts to deceive AI). Each of them is suitable for different situations: one works very accurately (DeBERTa), the other works quickly and saves resources (DistilBERT), and the third adapts flexibly to new data (SetFit). All of them are open and are often used in research and in AI systems protection systems, which allows you to compare the results and verify their reliability. This makes them convenient and representative for testing methods of industrial obfuscation. 1. protectai/deberta-v3-base-prompt-injection-v2 The protectai/deberta-v3-base-prompt-injection-v2 model is an improved version of the microsoft/deberta-v3-base model, which was further trained on a variety of combined datasets containing both prompt injections and conventional prompta. The main purpose of the model is to identify prompt injections by classifying the input data into two categories: 0 - no injection and 1 – injection detected. 2. XiweiZ/sts-ft-promptInjection The XiweiZ/sts-ft-promptInjection model is a SetFit model [19] designed for text classification. She was trained using an effective teaching technique with a small number of examples, which includes further training of the sentence transformer using contrastive learning. After that, the classification head is trained using the features obtained from the pre-trained sentence transformer. 3. fmops/distilbert-prompt-injection The fmops/distilbert-prompt-injection model is based on the DistilBERT architecture [20], which is a simplified and faster version of the original BERT model. DistilBERT retains the main characteristics of BERT, but at the same time has fewer parameters, which makes it more efficient in terms of computing resources. The fmops/distilbert-prompt-injection model is specially adapted for tasks related to prompt injection detection in texts. Test results fmops/distilbert-prompt-injection The results of fmops/distilbert-prompt-injection testing by various obfuscation methods are presented in Table 1. Sentence obfuscation methods have shown higher efficiency compared to word obfuscation methods. For example, the "word mixing" and "reverse word order" methods demonstrated the highest efficiency indicators – 0.17 and 0.20, respectively (Table 2). This indicates their ability to effectively bypass the protective mechanisms of the model. The methods of word obfuscation, such as "deleting random characters", "repeating letters", and "inserting random characters", showed negative effectiveness with an index of -0.08 (Table 2). This indicates that chaotic changes at the word level are not only unable to make it difficult for the model to detect prompt injections, but also make injections more noticeable. Some methods, for example, "transliteration" and "adding diacritics", demonstrated the inability to change the operation of the model, as evidenced by an efficiency indicator of 0.00 (Table 2). The experiment showed that obfuscation methods that change the structure of sentences are more effective in bypassing the fmops/distilbert-prompt-injection model. This may be due to the fact that changes at the sentence level create more significant distortions directly in the semantic structure of the text, while not creating changes in the sequence of tokens characteristic of attacking attacks, which makes it difficult for the model to analyze its danger. Methods that introduce chaotic changes at the word level, such as inserting random characters or repeating letters, are less successful because they often correspond to patterns indicating an attempt to circumvent the protective mechanisms of the model. This highlights the importance of the structure and coherence of the text to prevent the detection of prompt injections in it. Table 1 – Results of fmops/distilbert-prompt-injection testing by various obfuscation methods
According to formula (2), the effectiveness indicators of each obfuscation method are calculated, which are ranked in descending order. Table 2 – Effectiveness of various obfuscation methods in fmops/distilbert-prompt-injection testing
The XiweiZ/sts-ft-promptInjection model The results of XiweiZ/sts-ft-promptInjection testing by various obfuscation methods are presented in Table 3. During the analysis of the effectiveness of the XiweiZ / sts-ft-promptInjection model, it was found that the methods, compared with the results of the previously tested offer obfuscation model, demonstrate moderate effectiveness in circumventing protective mechanisms – the effectiveness range for such methods ranged from 0.01 to 0.06 (Table 4). The method of "inserting invisible characters" proved to be the most effective among them, achieving a success rate of 0.06 (Table 4). The methods of word obfuscation, including "deleting random characters", "inserting random characters" and "shuffling characters", also showed some of the highest efficiency indicators with a success rate of 0.05 (Table 4). However, it is important to keep in mind that overly complex methods that change words beyond recognition are more likely to be detected. This is due to the fact that models such as XiweiZ / sts-ft-promptInjection can be trained on data containing such anomalies and take into account the perplexity [21] of the prompta text when analyzing. In addition, the methods "transliteration", "reverse word order", "adding emojis", "reverse sentence order" and "adding random text" demonstrated the lowest success rate in the range from 0.01 to 0.02 (Table 4). Thus, in order to effectively bypass the model, it is necessary to find a balance between the complexity of the obfuscation method and the probability of its detection. This underlines the importance of a careful approach to the choice of obfuscation methods, taking into account the characteristics of the analyzed model and its ability to recognize anomalies in the text. Table 3 – Results of XiweiZ/sts-ft-promptInjection testing by various obfuscation methods
According to formula (2), the effectiveness indicators of each obfuscation method are calculated, which are ranked in descending order. Table 4 – Effectiveness of various obfuscation methods in XiweiZ/sts-ft-promptInjection testing
protectai/deberta-v3-base-prompt-injection model During the analysis of the effectiveness of the protectai/deberta-v3-base-prompt-injection model for obfuscated promptings, it was found that sentence obfuscation methods demonstrate the highest efficiency among other methods, however, it does not exceed 0.00, which indicates that there is no effect on the operation of the model. The "word mixing" and "reverse word order" methods showed effectiveness of 0.00 and -0.01, respectively (Table 6). Word obfuscation methods such as "transliteration" and "emoji addition" also demonstrated zero effect on the model's responses (Table 6). The rest of the methods have negative performance indicators, many of which are close to -1. It is noteworthy that the success rate for simple instructions without changes was 0.99 (Table 5). This means that the model demonstrates a high probability of mistakenly classifying potentially dangerous promptings as safe or is not able to adequately recognize attempts to use large language models (LLM) for malicious purposes in the form of prompt injections. In case of poor detection efficiency of prompt injections in malicious instructions, it should be borne in mind that the model can be trained to detect modified prompta, which can be used to bypass protective mechanisms and align models. In this regard, it is more important to take into account the comparison of the effectiveness of techniques among themselves, excluding comparisons with direct instructions. Table 5 – Results of protectai/deberta-v3-base-prompt-injection testing by various obfuscation methods
According to formula (2), the effectiveness indicators of each obfuscation method are calculated, which are ranked in descending order. Table 6 – Effectiveness of various obfuscation methods in protectai/deberta-v3-base-prompt-injection testing
Conclusion In the course of the study, the influence of various obfuscation methods on the effectiveness of classifier models designed to detect prompt injections was studied. The analysis showed that structural changes in the text, especially at the sentence level, can significantly reduce the ability of models to recognize dangerous queries. This highlights the urgency of further developing mechanisms to protect language models from attacks based on input distortion. The study was conducted using three modern models: protectai/deberta-v3, XiweiZ/sts-ft-promptInjection and fmops/distilbert-prompt-injection. For each obfuscation method, the proportion of SAFE classifications was calculated - that is, cases when the model mistakenly identified a malicious request as safe. The effectiveness of the method was estimated as the difference between the SAFE percentage for an obfuscated query and the base value obtained for unchanged instructions. This approach made it possible to objectively compare the results and identify the most vulnerable points of existing classifiers. It has been found that methods that change the structure of a sentence, such as shuffling words or reverse word order, have proved to be the most successful in bypassing prompt injection detectors. At the same time, chaotic distortions at the word level often caused increased attention from the models and did not lead to a significant increase in the number of false positives. The data obtained provide a basis for improving protection systems that are resistant to complex forms of obfuscation. The need for more rigorous testing and verification of models used to filter potentially dangerous content was also emphasized. The results demonstrate the importance of finding a balance between the complexity of the obfuscation method and its ability to mask an attack. They also open up prospects for further research aimed at increasing the resilience of large language models to manipulation and creating more reliable mechanisms for their protection. Prospects for further research The prospects for further research include expanding the experimental base by analyzing the success of jailbreaking large language models (LLM) using various obfuscation methods. This will not only deepen our understanding of the vulnerabilities of existing defense mechanisms, but also identify key parameters that affect the effectiveness of attacks. One of the key tasks will be to find a balance between two opposite effects: a decrease in the proportion of requests blocked by classifier models, and a simultaneous increase in the proportion of requests leading to a successful LLM jailbreak. To do this, it is planned to conduct a series of experiments where the parameters of obfuscation will be systematically varied, such as the complexity of transformations, their number and type (lexical, semantic or syntactic changes). Based on the data obtained, it is planned to identify the most effective obfuscation methods that minimize the likelihood of an attack being detected by defense mechanisms, but maximize the likelihood of compromising the target model. References
1. Liu, Y., et al. (2024). Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24) (pp. 1831-1847).
2. Greshake, K., et al. (2023). Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (pp. 79-90). 3. Shi, J., et al. (2024). Optimization-based prompt injection attack to LLM-as-a-judge. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (pp. 660-674). 4. Sang, X., Gu, M., & Chi, H. (2024). Evaluating prompt injection safety in large language models using the PromptBench dataset. 5. Xu, Z., et al. (2024). LLM jailbreak attack versus defense techniques: A comprehensive study. arXiv e-prints. arXiv:2402.13457. 6. Hu, K., et al. (2024). Efficient LLM jailbreak via adaptive dense-to-sparse constrained optimization. In Advances in Neural Information Processing Systems (Vol. 37, pp. 23224-23245). 7. Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (Vol. 36, pp. 80079-80110). 8. Li, J., et al. (2024). Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment. In Advances in Neural Information Processing Systems (Vol. 37, pp. 124292-124318). 9. Kwon, H., & Pak, W. (2024). Text-based prompt injection attack using mathematical functions in modern large language models. Electronics, 13(24), 5008. 10. Steindl, S., et al. (2024). Linguistic obfuscation attacks and large language model uncertainty. In Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024) (pp. 35-40). 11. Kim, M., et al. (2024). Protection of LLM environment using prompt security. In 2024 15th International Conference on Information and Communication Technology Convergence (ICTC) (pp. 1715-1719). IEEE. 12. Wei, Z., Liu, Y., & Erichson, N. B. (2024). Emoji attack: A method for misleading judge LLMs in safety risk detection. arXiv preprint arXiv:2411.01077. 13. Rahman, M. A., et al. (2024). Applying pre-trained multilingual BERT in embeddings for improved malicious prompt injection attacks detection. In 2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings) (pp. 1-7). IEEE. 14. Chen, Q., Yamaguchi, S., & Yamamoto, Y. (2025). LLM abuse prevention tool using GCG jailbreak attack detection and DistilBERT-based ethics judgment. Information, 16(3), 204. 15. Aftan, S., & Shah, H. (2023). A survey on BERT and its applications. In 2023 20th Learning and Technology Conference (L&T) (pp. 161-166). IEEE. 16. Chan, C. F., Yip, D. W., & Esmradi, A. (2023). Detection and defense against prominent attacks on preconditioned LLM-integrated virtual assistants. In 2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE) (pp. 1-5). IEEE. 17. Biarese, D. (2022). AdvBench: A framework to evaluate adversarial attacks against fraud detection systems. 18. Liu, W., et al. (2025). DrBioRight 2.0: An LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis. Nature Communications, 16(1), 2256. https://doi.org/10.1038/s41467-025-57430-4 19. Pannerselvam, K., et al. (2024). SetFit: A robust approach for offensive content detection in Tamil-English code-mixed conversations using sentence transfer fine-tuning. In Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (pp. 35-42). 20. Akpatsa, S. K., et al. (2022). Online news sentiment classification using DistilBERT. Journal of Quantum Computing, 4(1). 21. Grytsay, H. M., Khabutdinov, I. A., & Grabovoy, A. V. (2024). Stackmore LLMs: Effective detection of machine-generated texts using perplexity value approximation. Reports of the Russian Academy of Sciences. Mathematics, Computer Science, Control Processes, 520(2), 228-237. https://doi.org/10.31857/S2686954324700590 22. Pape, D., et al. (2024). Prompt obfuscation for large language models. arXiv preprint arXiv:2409.11026. 23. Evglevskaya, N. V., & Kazantsev, A. A. (2024). Ensuring the security of complex systems integrating large language models: Threat analysis and defense methods. Economics and Quality of Communication Systems, 4(34), 129-144. 24. Shang, S., et al. (2024). Intentobfuscator: A jailbreaking method via confusing LLM with prompts. In European Symposium on Research in Computer Security (pp. 146-165). Springer Nature Switzerland.
First Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
Second Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
Third Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|