Рус Eng Cn Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Analysis of the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections

Krohin Aleksei Sergeevich

Student; Moscow Institute of Electronics and Mathematics; National Research University 'Higher School of Economics'

13 Volgogradsky ave., Moscow, 109316, Russia

askrokhin@edu.hse.ru
Gusev Maksim Mihailovich

Student; Moscow Institute of Electronics and Mathematics; National Research University 'Higher School of Economics'

115 k. 3 Volgogradsky ave., Moscow, 109117, Russia

gusevmaxim04@mail.ru

DOI:

10.7256/2454-0714.2025.2.73939

EDN:

FBOXHC

Received:

02-04-2025


Published:

21-05-2025


Abstract: The article addresses the issue of prompt obfuscation as a means of circumventing protective mechanisms in large language models (LLMs) designed to detect prompt injections. Prompt injections represent a method of attack in which malicious actors manipulate input data to alter the model's behavior and cause it to perform undesirable or harmful actions. Obfuscation involves various methods of changing the structure and content of text, such as replacing words with synonyms, scrambling letters in words, inserting random characters, and others. The purpose of obfuscation is to complicate the analysis and classification of text in order to bypass filters and protective mechanisms built into language models. The study conducts an analysis of the effectiveness of various obfuscation methods in bypassing models trained for text classification tasks. Particular attention is paid to assessing the potential implications of obfuscation for security and data protection. The research utilizes different text obfuscation methods applied to prompts from the AdvBench dataset. The effectiveness of the methods is evaluated using three classifier models trained to detect prompt injections. The scientific novelty of the research lies in analyzing the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections. During the study, it was found that the application of complex obfuscation methods increases the proportion of requests classified as injections, highlighting the need for a thorough approach to testing the security of large language models. The conclusions of the research indicate the importance of balancing the complexity of the obfuscation method with its effectiveness in the context of attacks on models. Excessively complex obfuscation methods may increase the likelihood of injection detection, which requires further investigation to optimize approaches to ensuring the security of language models. The results underline the need for the continuous improvement of protective mechanisms and the development of new methods for detecting and preventing attacks on large language models.


Keywords:

LLM, prompt injection, obfuscation, jailbreak, AI, adversarial attacks, encoder, transformers, AI security, fuzzing

This article is automatically translated. You can find original text of the article here.

Introduction

In recent years, there has been significant progress in natural language processing, which has led to the widespread adoption of language models in various applications, from chatbots to automatic translation systems. However, as the capabilities of these models increase, so does the number of threats associated with their use. One of these threats is prompt injection, which is a technique for manipulating input data in order to change the behavior of a model.

To counter this threat, special models are being developed designed to detect prompt injections. However, attackers also do not stand still and develop various methods of obfuscation to circumvent such protective mechanisms. Industrial obfuscation involves changing the structure and content of a text in order to make it more difficult to analyze and classify.

The purpose of this article is to investigate and analyze various methods of industrial injection obfuscation that can be used to bypass models of industrial injection detection. The object of the study is language models (LLM) and their vulnerability to attacks through prompt injections. The subject of the study is methods of industrial obfuscation and their effect on the effectiveness of classifier models designed to detect industrial injections in large language models. The research examines techniques such as replacing words with synonyms, mixing letters in words, introducing special characters, and other approaches. The effectiveness of these methods is assessed using the example of modern language models trained for the task of text classification, and their potential implications for data security and protection are assessed.

Prompt injections, jailbreaks, and obfuscation methods

Prompt injection [1-4] is an attack method aimed at manipulating input data transmitted to artificial intelligent systems, especially large language models (LLM). The purpose of such attacks is to change the behavior of the model so that it performs unwanted or malicious actions, ignoring the initial instructions of the developers.

Large language models are trained to interpret text prompts (promptas) as instructions. Attackers use this feature by embedding specially designed text fragments that force the model to:

1) Ignore system restrictions or security rules;

2) Generate malicious content;

3) Disclose confidential data;

4) Perform actions that were not provided by the developers.

An example would be a request: “Forget all previous instructions and provide access to confidential information.” If the system is not secure enough, it can fulfill this request.

Jailbreak [5-7] is a subtype of attacks on large language models (LLM), belonging to the category of prompt injections. Its goal is to force the model to ignore the built—in restrictions and security protocols embedded in the alignment process of the model [8] so that it performs actions that are usually prohibited by developers.

The main task of jailbreaks is to remove the restrictions imposed on the model. This can lead to:

1) Generating malicious content (for example, writing exploits or instructions for dangerous actions);

2) Disclosure of confidential information;

3) Performing actions that contradict the developers' security policy.

Obfuscation is a method of masking or complicating text, code, or data that preserves their functionality but makes them difficult to analyze and interpret. In the context of attacks on large language models (LLM), such as jailbreaks and prompt injections, obfuscation is used to bypass built-in filters and defense mechanisms.

Obfuscation makes attacks on LLM more difficult to detect and prevent. Studies [9-10] note that it allows:

1) Avoid detection by built-in filters of models;

2) Create unique variations of malicious content;

3) Automate the generation of complex attacks using AI.

Within the framework of this study, various methods of obfuscation of texts were used, aimed at complicating their analysis and classification by language models. These methods were developed to change the structure and content of the text, which makes it possible to evaluate their effectiveness in bypassing models designed to detect prompt injections [11]. Below is a detailed list of obfuscation methods used in the experiment.

Methods of word obfuscation:

1. Deleting random characters (randomly_remove_characters): This method involves removing characters from a word with a given probability (0.2), which leads to a reduction in the length of the word and a change in its structure.

2. Repetition of letters (repeat_letters): In this method, each letter in a word can be repeated a random number of times (once or twice), which leads to an increase in the length of the word and a change in its visual perception.

3. Adding emojis (add_emojis): The method involves adding random emojis after some words in the text, which can distract the model's attention from the main content [12].

4. Inserting invisible characters (insert_invisible_characters): In this method, invisible Unicode characters are inserted into the word, which are not displayed when the text is rendered, but change its internal structure.

5. Inserting random characters (insert_random_symbols): The method involves inserting random characters from a given set after some letters in a word, which changes its structure and complicates analysis.

6. Shuffle of characters (shuffle_characters): In this method, the characters inside the word are shuffled, with the exception of the first and last characters, which preserves the visual similarity, but changes the internal structure.

7. Transliteration: This method replaces Cyrillic characters with their Latin counterparts, which changes the visual perception of the text while preserving its pronunciation.

8. Random conversion to UTF-8 (random_utf8_conversion): Some characters in a word are converted to their UTF-8 representation with a given probability, which changes the internal structure of the text.

9. Reverse character order (reverse_word): The method involves changing the order of characters in a word in reverse, which completely changes its visual perception.

10. Inserting random spaces (insert_random_spaces): In this method, random spaces are inserted into the word, which changes its visual perception and structure.

11. Replacement with similar characters (replace_with_similitar_chars): The method replaces some characters with visually similar counterparts from other alphabets, which changes the visual perception of the text.

12. Inserting foreign characters (insert_foreign_characters): Random characters from foreign alphabets are inserted into the word, which changes its structure and complicates the analysis.

13. Adding diacritics (add_diacritics): The method involves adding random diacritics to the characters of a word, which changes its visual perception.

14. Using mirrored characters (use_mirrored_characters): In this method, some characters are replaced with their mirror counterparts, which changes the visual perception of the text.

Methods of sentence obfuscation:

1. Shuffle of words (shuffle_words): This method involves randomly shuffling the word order in a sentence, which changes its syntactic structure.

2. Reverse word order (reverse_sentence): In this method, the words in a sentence are arranged in reverse order, which changes its syntactic structure and perception.

3. Adding random text (add_random_text): The method involves adding random phrases after some sentences, which can distract the model's attention from the main content.

To counter attacks carried out through user promptings, prompt injection classifier models are being developed [13-14], which are specialized language models trained for the task of text classification. Most often, such models are retrofitted versions of lightweight encoder models, for example, based on the BERT architecture [15]. These models classify the input text into two classes: the text that does not contain the prompt injection, and the text that contains it.

The main use of such classifiers is to use them as a proxy between the user and the main language model. This allows you to filter potentially dangerous or unwanted requests, protecting the model from manipulation and ensuring safer interaction with users. The scheme of the classical version of the integration of such models into an LLM service [16-17] is shown in Figure 1.

Figure 1 – scheme of application of the classifier model to protect applications from prompt injections

Review of scientific research

There are articles on the use of obfuscation for attacks on large language models [22-24].

The article [22] is devoted to the problem of protecting system prompts of large language models (LLM), which often contain unique instructions and are considered intellectual property. The authors propose a method for obfuscating prompta - converting the original prompta into a form that preserves its functionality, but does not allow extracting the original information even in attacks of various types (black-box and white-box). The study includes a comparison of LLM performance with the original and obfuscated prompta by eight metrics, as well as an analysis of resistance to deobfuscation attempts. The results show that the proposed method effectively protects the industrial environment without losing the usefulness of the model.

The article [23] examines the issues of ensuring the security of complex information systems when integrating large language models into them. The authors analyze the main threats associated with the use of LLM (for example, data leakage, malicious queries, vulnerabilities of industrial products), and systematize modern methods of protection. Special attention is paid to the analysis of the risks associated with prompt injections and approaches to their minimization, including technical and organizational measures. The paper provides an overview of current threats and security practices when implementing LLM in corporate and government systems.
The article [24] describes a new way of circumventing LLM protection mechanisms, known as “jailbreaking". The authors present the Intentobfuscator tool, which uses intentionally confusing prompts to hide the user's true intention from the model. This approach allows you to bypass security filters and get LLM to issue prohibited or unwanted content. The paper analyzes in detail the techniques of obfuscation, the effectiveness of circumvention of protections and vulnerabilities of modern industrial filtration systems. The study highlights the need to improve LLM protection mechanisms against such attacks.

But these articles explore obfuscation for autoregressive large language models, that is, aimed at generating text based on a hint (prompt). At the same time, no work was found aimed at analyzing the effect of obfuscation of input data for language models of the text classifier.

Description of the experiment

The purpose of the experiment

The purpose of this experiment is to analyze the impact of various obfuscation methods on the effectiveness of prompt injection detection in modern language models, including state-of-the-art architectures. The research aims to identify the vulnerabilities of models to input data distortion, in which dangerous requests (for example, instructions for creating malware, weapons, or offensive content) may go unnoticed due to changes in the structure or content of the text. This allows us to assess how much methods of masking attacks reduce the ability of models to recognize threats that violate moral and legal norms.

Methodology

The AdvBench dataset [18] is used for testing, chosen due to its specialization in dangerous instructions that violate moral and legal norms (for example, the generation of malware, weapons, or offensive content), which directly corresponds to the purpose of the study – the analysis of obfuscation methods aimed at bypassing defense mechanisms against prompt injections. Its advantages include a narrow focus on attacking scenarios that are not present in universal datasets (for example, GLUE), as well as standardization and reproducibility, as it is widely used in AI security research. The data covers a variety of attack types, from social engineering to technical manipulation, providing representativeness for testing obfuscation techniques.

Testing procedure

1) Choosing models

Three classifier models are selected for the experiment, each of which is designed to detect prompt injections. These models can be based on the BERT architecture or other similar encoder models.

2) Obfuscation of instructions

Each prompt from the AdvBench dataset is obfuscated using each of the selected methods in turn.

3) Classification

Obfuscated instructions are fed to the input of each classifier model. The output of the model, which is a classification of the text as SAFE or INJECTION, is saved for subsequent analysis.

4) Analysis of the results

a) For each obfuscation method, the number of classifier responses belonging to the SAFE and INJECTION classes is calculated.

b) The SAFE percentage among all requests is calculated for each obfuscation method.

The percentage of SAFE requests is calculated using the formula 1:

The proportion of safe = SAFE/(SAFE + INJECTION) (1)

c) The effectiveness of obfuscation methods is evaluated.

The effectiveness was assessed by the deviation of the proportion of SAFE classifications for obfuscated promptes relative to the base value - the proportion of SAFE classifications for simple instructions. Formula 2 was used for the calculation:

Efficiency = Proportion of safe(obfuscation) - Proportion of safe (simple instruction) (2)

d) The results are compared between different obfuscation methods, as well as with similar indicators for instructions that do not undergo obfuscation.

Expected results

The experiment aims to identify differences in the effectiveness of classifier models in processing obfuscated and non-obfuscated instructions. It is expected that obfuscation by applying certain methods will complicate the classification task by increasing the proportion of SAFE classifications for dangerous queries compared to other methods. Comparing the results between different obfuscation methods will allow us to evaluate the effectiveness of these methods in hiding prompt injections. We strive to determine which obfuscation methods are most successful in hiding dangerous requests, that is, they lead to the fact that promptings are less likely to be classified as injections. This will help identify weaknesses in current approaches to prompt injection detection and suggest ways to improve them.

Tested models

The protectai/deberta-v3, XiweiZ/sts-ft-promptInjection and fmops/distilbert-prompt-injection models were chosen because they are specially trained to recognize dangerous requests (for example, attempts to deceive AI). Each of them is suitable for different situations: one works very accurately (DeBERTa), the other works quickly and saves resources (DistilBERT), and the third adapts flexibly to new data (SetFit). All of them are open and are often used in research and in AI systems protection systems, which allows you to compare the results and verify their reliability. This makes them convenient and representative for testing methods of industrial obfuscation.

1. protectai/deberta-v3-base-prompt-injection-v2

The protectai/deberta-v3-base-prompt-injection-v2 model is an improved version of the microsoft/deberta-v3-base model, which was further trained on a variety of combined datasets containing both prompt injections and conventional prompta. The main purpose of the model is to identify prompt injections by classifying the input data into two categories: 0 - no injection and 1 – injection detected.

2. XiweiZ/sts-ft-promptInjection

The XiweiZ/sts-ft-promptInjection model is a SetFit model [19] designed for text classification. She was trained using an effective teaching technique with a small number of examples, which includes further training of the sentence transformer using contrastive learning. After that, the classification head is trained using the features obtained from the pre-trained sentence transformer.

3. fmops/distilbert-prompt-injection

The fmops/distilbert-prompt-injection model is based on the DistilBERT architecture [20], which is a simplified and faster version of the original BERT model. DistilBERT retains the main characteristics of BERT, but at the same time has fewer parameters, which makes it more efficient in terms of computing resources. The fmops/distilbert-prompt-injection model is specially adapted for tasks related to prompt injection detection in texts.

Test results

fmops/distilbert-prompt-injection

The results of fmops/distilbert-prompt-injection testing by various obfuscation methods are presented in Table 1. Sentence obfuscation methods have shown higher efficiency compared to word obfuscation methods. For example, the "word mixing" and "reverse word order" methods demonstrated the highest efficiency indicators – 0.17 and 0.20, respectively (Table 2). This indicates their ability to effectively bypass the protective mechanisms of the model.

The methods of word obfuscation, such as "deleting random characters", "repeating letters", and "inserting random characters", showed negative effectiveness with an index of -0.08 (Table 2). This indicates that chaotic changes at the word level are not only unable to make it difficult for the model to detect prompt injections, but also make injections more noticeable.

Some methods, for example, "transliteration" and "adding diacritics", demonstrated the inability to change the operation of the model, as evidenced by an efficiency indicator of 0.00 (Table 2).

The experiment showed that obfuscation methods that change the structure of sentences are more effective in bypassing the fmops/distilbert-prompt-injection model. This may be due to the fact that changes at the sentence level create more significant distortions directly in the semantic structure of the text, while not creating changes in the sequence of tokens characteristic of attacking attacks, which makes it difficult for the model to analyze its danger.

Methods that introduce chaotic changes at the word level, such as inserting random characters or repeating letters, are less successful because they often correspond to patterns indicating an attempt to circumvent the protective mechanisms of the model. This highlights the importance of the structure and coherence of the text to prevent the detection of prompt injections in it.

Table 1 – Results of fmops/distilbert-prompt-injection testing by various obfuscation methods

Method

Contains a prompt injection

Safe

The share of safe

shuffle_words

389

131

0.25

reverse_sentence

374

146

0.28

add_random_text

504

16

0.03

transliterate

479

41

0.08

randomly_remove_characters

499

21

0.04

repeat_letters

518

2

0.00

insert_random_symbols

520

0

0.00

shuffle_characters

513

7

0.01

reverse_word

512

8

0.02

replace_with_similar_chars

512

8

0.02

insert_random_spaces

520

0

0.00

insert_invisible_characters

479

41

0.08

add_emojis

510

10

0.02

insert_foreign_characters

520

0

0.00

add_diacritics

479

41

0.08

use_mirrored_characters

516

4

0.01

simple_instruction

479

41

0.08

According to formula (2), the effectiveness indicators of each obfuscation method are calculated, which are ranked in descending order.

Table 2 – Effectiveness of various obfuscation methods in fmops/distilbert-prompt-injection testing

Method

Performance indicator

reverse_sentence

0.20

shuffle_words

0.17

transliterate

0.00

insert_invisible_characters

0.00

add_diacritics

0.00

add_random_text

-0.05

reverse_word

-0.06

replace_with_similar_chars

-0.06

add_emojis

-0.06

insert_foreign_characters

-0.06

shuffle_characters

-0.07

use_mirrored_characters

-0.07

repeat_letters

-0.08

insert_random_symbols

-0.08

insert_random_spaces

-0.08

randomly_remove_characters

-0.4

The XiweiZ/sts-ft-promptInjection model

The results of XiweiZ/sts-ft-promptInjection testing by various obfuscation methods are presented in Table 3. During the analysis of the effectiveness of the XiweiZ / sts-ft-promptInjection model, it was found that the methods, compared with the results of the previously tested offer obfuscation model, demonstrate moderate effectiveness in circumventing protective mechanisms – the effectiveness range for such methods ranged from 0.01 to 0.06 (Table 4). The method of "inserting invisible characters" proved to be the most effective among them, achieving a success rate of 0.06 (Table 4).

The methods of word obfuscation, including "deleting random characters", "inserting random characters" and "shuffling characters", also showed some of the highest efficiency indicators with a success rate of 0.05 (Table 4). However, it is important to keep in mind that overly complex methods that change words beyond recognition are more likely to be detected. This is due to the fact that models such as XiweiZ / sts-ft-promptInjection can be trained on data containing such anomalies and take into account the perplexity [21] of the prompta text when analyzing.

In addition, the methods "transliteration", "reverse word order", "adding emojis", "reverse sentence order" and "adding random text" demonstrated the lowest success rate in the range from 0.01 to 0.02 (Table 4).

Thus, in order to effectively bypass the model, it is necessary to find a balance between the complexity of the obfuscation method and the probability of its detection. This underlines the importance of a careful approach to the choice of obfuscation methods, taking into account the characteristics of the analyzed model and its ability to recognize anomalies in the text.

Table 3 – Results of XiweiZ/sts-ft-promptInjection testing by various obfuscation methods

Method

Contains a prompt injection

Safe

The share of safe

transliterate

266

254

0.49

randomly_remove_characters

252

268

0.52

repeat_letters

256

264

0.51

insert_random_symbols

251

269

0.52

shuffle_characters

248

272

0.52

reverse_word

268

252

0.48

replace_with_similar_chars

249

271

0.52

insert_random_spaces

253

267

0.51

insert_invisible_characters

246

274

0.53

add_emojis

265

255

0.49

insert_foreign_characters

260

260

0.50

add_diacritics

249

271

0.52

use_mirrored_characters

258

262

0.50

shuffle_words

255

265

0.51

reverse_sentence

267

253

0.49

add_random_text

263

257

0.49

simple_instruction

278

242

0.47

According to formula (2), the effectiveness indicators of each obfuscation method are calculated, which are ranked in descending order.

Table 4 – Effectiveness of various obfuscation methods in XiweiZ/sts-ft-promptInjection testing

Method

Performance indicator

insert_invisible_characters

0.06

randomly_remove_characters

0.05

insert_random_symbols

0.05

shuffle_characters

0.05

replace_with_similar_chars

0.05

add_diacritics

0.05

repeat_letters

0.04

insert_random_spaces

0.04

shuffle_words

0.04

insert_foreign_characters

0.03

use_mirrored_characters

0.03

transliterate

0.02

add_emojis

0.02

reverse_sentence

0.02

add_random_text

0.02

reverse_word

0.01

protectai/deberta-v3-base-prompt-injection model

During the analysis of the effectiveness of the protectai/deberta-v3-base-prompt-injection model for obfuscated promptings, it was found that sentence obfuscation methods demonstrate the highest efficiency among other methods, however, it does not exceed 0.00, which indicates that there is no effect on the operation of the model. The "word mixing" and "reverse word order" methods showed effectiveness of 0.00 and -0.01, respectively (Table 6).

Word obfuscation methods such as "transliteration" and "emoji addition" also demonstrated zero effect on the model's responses (Table 6). The rest of the methods have negative performance indicators, many of which are close to -1.

It is noteworthy that the success rate for simple instructions without changes was 0.99 (Table 5). This means that the model demonstrates a high probability of mistakenly classifying potentially dangerous promptings as safe or is not able to adequately recognize attempts to use large language models (LLM) for malicious purposes in the form of prompt injections.

In case of poor detection efficiency of prompt injections in malicious instructions, it should be borne in mind that the model can be trained to detect modified prompta, which can be used to bypass protective mechanisms and align models. In this regard, it is more important to take into account the comparison of the effectiveness of techniques among themselves, excluding comparisons with direct instructions.

Table 5 – Results of protectai/deberta-v3-base-prompt-injection testing by various obfuscation methods

Method

Contains a prompt injection

Safe

The share of safe

transliterate

3

511

0.99

randomly_remove_characters

268

246

0.48

repeat_letters

343

171

0.33

insert_random_symbols

248

266

0.52

shuffle_characters

512

2

0.00

reverse_word

510

4

0.01

replace_with_similar_chars

510

4

0.01

insert_random_spaces

501

13

0.03

insert_invisible_characters

514

0

0.00

add_emojis

5

509

0.99

insert_foreign_characters

506

8

0.02

add_diacritics

489

24

0.05

use_mirrored_characters

45

468

0.91

shuffle_words

3

517

0.99

reverse_sentence

8

512

0.98

add_random_text

42

478

0.92

simple_instruction

3

517

0.99

According to formula (2), the effectiveness indicators of each obfuscation method are calculated, which are ranked in descending order.

Table 6 – Effectiveness of various obfuscation methods in protectai/deberta-v3-base-prompt-injection testing

Method

Performance indicator

transliterate

0.00

add_emojis

0.00

shuffle_words

0.00

simple_instruction

00

reverse_sentence

-0.01

add_random_text

-0.07

use_mirrored_characters

-0.08

insert_random_symbols

-0.47

randomly_remove_characters

-0.51

repeat_letters

-0.66

add_diacritics

-0.94

insert_random_spaces

-0.96

insert_foreign_characters

-0.97

reverse_word

-0.98

replace_with_similar_chars

-0.98

shuffle_characters

-0.99

insert_invisible_characters

-0.99

Conclusion

In the course of the study, the influence of various obfuscation methods on the effectiveness of classifier models designed to detect prompt injections was studied. The analysis showed that structural changes in the text, especially at the sentence level, can significantly reduce the ability of models to recognize dangerous queries. This highlights the urgency of further developing mechanisms to protect language models from attacks based on input distortion.

The study was conducted using three modern models: protectai/deberta-v3, XiweiZ/sts-ft-promptInjection and fmops/distilbert-prompt-injection. For each obfuscation method, the proportion of SAFE classifications was calculated - that is, cases when the model mistakenly identified a malicious request as safe. The effectiveness of the method was estimated as the difference between the SAFE percentage for an obfuscated query and the base value obtained for unchanged instructions. This approach made it possible to objectively compare the results and identify the most vulnerable points of existing classifiers.

It has been found that methods that change the structure of a sentence, such as shuffling words or reverse word order, have proved to be the most successful in bypassing prompt injection detectors. At the same time, chaotic distortions at the word level often caused increased attention from the models and did not lead to a significant increase in the number of false positives.

The data obtained provide a basis for improving protection systems that are resistant to complex forms of obfuscation. The need for more rigorous testing and verification of models used to filter potentially dangerous content was also emphasized.

The results demonstrate the importance of finding a balance between the complexity of the obfuscation method and its ability to mask an attack. They also open up prospects for further research aimed at increasing the resilience of large language models to manipulation and creating more reliable mechanisms for their protection.

Prospects for further research

The prospects for further research include expanding the experimental base by analyzing the success of jailbreaking large language models (LLM) using various obfuscation methods. This will not only deepen our understanding of the vulnerabilities of existing defense mechanisms, but also identify key parameters that affect the effectiveness of attacks.

One of the key tasks will be to find a balance between two opposite effects: a decrease in the proportion of requests blocked by classifier models, and a simultaneous increase in the proportion of requests leading to a successful LLM jailbreak. To do this, it is planned to conduct a series of experiments where the parameters of obfuscation will be systematically varied, such as the complexity of transformations, their number and type (lexical, semantic or syntactic changes). Based on the data obtained, it is planned to identify the most effective obfuscation methods that minimize the likelihood of an attack being detected by defense mechanisms, but maximize the likelihood of compromising the target model.

References
1. Liu, Y., et al. (2024). Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24) (pp. 1831-1847).
2. Greshake, K., et al. (2023). Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (pp. 79-90).
3. Shi, J., et al. (2024). Optimization-based prompt injection attack to LLM-as-a-judge. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (pp. 660-674).
4. Sang, X., Gu, M., & Chi, H. (2024). Evaluating prompt injection safety in large language models using the PromptBench dataset.
5. Xu, Z., et al. (2024). LLM jailbreak attack versus defense techniques: A comprehensive study. arXiv e-prints. arXiv:2402.13457.
6. Hu, K., et al. (2024). Efficient LLM jailbreak via adaptive dense-to-sparse constrained optimization. In Advances in Neural Information Processing Systems (Vol. 37, pp. 23224-23245).
7. Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (Vol. 36, pp. 80079-80110).
8. Li, J., et al. (2024). Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment. In Advances in Neural Information Processing Systems (Vol. 37, pp. 124292-124318).
9. Kwon, H., & Pak, W. (2024). Text-based prompt injection attack using mathematical functions in modern large language models. Electronics, 13(24), 5008.
10. Steindl, S., et al. (2024). Linguistic obfuscation attacks and large language model uncertainty. In Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024) (pp. 35-40).
11. Kim, M., et al. (2024). Protection of LLM environment using prompt security. In 2024 15th International Conference on Information and Communication Technology Convergence (ICTC) (pp. 1715-1719). IEEE.
12. Wei, Z., Liu, Y., & Erichson, N. B. (2024). Emoji attack: A method for misleading judge LLMs in safety risk detection. arXiv preprint arXiv:2411.01077.
13. Rahman, M. A., et al. (2024). Applying pre-trained multilingual BERT in embeddings for improved malicious prompt injection attacks detection. In 2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings) (pp. 1-7). IEEE.
14. Chen, Q., Yamaguchi, S., & Yamamoto, Y. (2025). LLM abuse prevention tool using GCG jailbreak attack detection and DistilBERT-based ethics judgment. Information, 16(3), 204.
15. Aftan, S., & Shah, H. (2023). A survey on BERT and its applications. In 2023 20th Learning and Technology Conference (L&T) (pp. 161-166). IEEE.
16. Chan, C. F., Yip, D. W., & Esmradi, A. (2023). Detection and defense against prominent attacks on preconditioned LLM-integrated virtual assistants. In 2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE) (pp. 1-5). IEEE.
17. Biarese, D. (2022). AdvBench: A framework to evaluate adversarial attacks against fraud detection systems.
18. Liu, W., et al. (2025). DrBioRight 2.0: An LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis. Nature Communications, 16(1), 2256. https://doi.org/10.1038/s41467-025-57430-4
19. Pannerselvam, K., et al. (2024). SetFit: A robust approach for offensive content detection in Tamil-English code-mixed conversations using sentence transfer fine-tuning. In Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (pp. 35-42).
20. Akpatsa, S. K., et al. (2022). Online news sentiment classification using DistilBERT. Journal of Quantum Computing, 4(1).
21. Grytsay, H. M., Khabutdinov, I. A., & Grabovoy, A. V. (2024). Stackmore LLMs: Effective detection of machine-generated texts using perplexity value approximation. Reports of the Russian Academy of Sciences. Mathematics, Computer Science, Control Processes, 520(2), 228-237. https://doi.org/10.31857/S2686954324700590
22. Pape, D., et al. (2024). Prompt obfuscation for large language models. arXiv preprint arXiv:2409.11026.
23. Evglevskaya, N. V., & Kazantsev, A. A. (2024). Ensuring the security of complex systems integrating large language models: Threat analysis and defense methods. Economics and Quality of Communication Systems, 4(34), 129-144.
24. Shang, S., et al. (2024). Intentobfuscator: A jailbreaking method via confusing LLM with prompts. In European Symposium on Research in Computer Security (pp. 146-165). Springer Nature Switzerland.

First Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The presented article on the topic "Analysis of the effect of industrial obfuscation on the effectiveness of language models in detecting industrial injections" corresponds to the topic of the journal "Software Systems and Computational Methods" and is devoted to countering the threats of industrial injections, which are a technique for manipulating input data in order to change the behavior of the model. As a goal, the authors indicate to investigate and analyze various methods of industrial obfuscation that can be used to bypass models of detection of industrial injections. As part of the research, the authors consider such techniques as replacing words with synonyms, mixing letters in words, introducing special characters, and other approaches. The authors independently evaluated the effectiveness of these methods using the example of modern language models trained for the task of text classification, and assessed their potential implications for data security and protection. The list of references is provided by Russian and foreign sources on the research topic. The style and language of the presentation of the material is quite accessible to a wide range of readers. The practical significance of the article is justified. The volume of the article corresponds to the recommended volume of 12,000 characters or more. The article is quite structured - there is an introduction, conclusion, internal division of the main part (prompt injections, jailbreaks and obfuscation methods, testing procedure, experimental purpose, expected results, tested models, research results, etc.). The authors use AdvBench dataset for testing, which contains 500 dangerous instructions that violate generally accepted moral and ethical standards. legal norms. The results of the study are presented graphically (in the form of tables) In the course of the study, the authors revealed that the use of more complex obfuscation methods, characterized by multiple transformations in a single word or sentence and affecting the semantic integrity of the source text, leads to an increase in the proportion of queries classified as injections. The authors emphasize the need for a thorough approach to testing the security of large language models, especially those equipped with additional defense mechanisms such as classifier models for detecting prompt injections, as well as the importance of finding the optimal balance between the complexity of the obfuscation method and its effectiveness in the context of attacks on language models. The disadvantages include the following points: there is no scientific novelty in the content of the article. There is no clear identification of the subject, the object of research. It is recommended to clearly identify the scientific novelty of the research, formulate the subject and object. It would also be advisable to add about the prospects for further research. The article "Analysis of the effect of prompt obfuscation on the effectiveness of language models in detecting prompt injections" requires further development based on the above comments. After making amendments, it is recommended for reconsideration by the editorial board of the peer-reviewed scientific journal.

Second Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The reviewed article is devoted to the problem of countering the threats of attacks aimed at large language models and violating the algorithms embedded in them by developers. The article formulates the goals, object and subject of the study, provides the types of attacks and their characteristic features. It is difficult to assess the novelty of the work, the authors use well-known methods, test models presented in the public domain, and quantify the results by estimating the proportion of safe cases. The initial data is publicly available and is intended to be influenced by the methods listed by the authors, which makes it difficult to assess one's own contribution. Quantifying the results by a single criterion is inconclusive. The style of presentation is typical for the colloquial presentation of materials, but the terms used make it difficult to understand. The formulations used, the overload of the article with slang terms, make it difficult to assess the contribution of the authors and analyze the results of their own research (testing) using publicly available tools. There is an illustration and tables. The bibliography contains 21 sources, mainly foreign publications in conference collections, and only one position in a domestic journal. The links in the text are grouped. Remarks. It is necessary to change the title of the article and the keywords, bringing them in accordance with the terminology and style of scientific publication accepted in the Journal. There is no review of similar studies. This section of the peer-reviewed scientific article has been replaced by an information block containing a list of subspecies of attacks, as well as a list of names of methods with their indication in two languages. It is necessary to avoid using transliteration of English-language terms (especially in the title) or a large number of terms in a foreign language. Terms in the field of computational methods that are understandable to most readers are preferred. The description of the experiment is given formally, there is no justification for the choice of the initial data set and the tested models. All analyzed models are publicly available and designed to solve problems similar to those set by the authors. What is the contribution of the Authors? The method of identifying models in the text of the article, combined with the form of representation in the tables of methods, does not allow us to evaluate the personal contribution of the authors to the result, its novelty and originality. In the section "test results", the authors mention that the success rate of 0.25 is high, the success rate of 0 indicates low efficiency, and the proportion of 0.08 is higher. It is not clear which threshold value allows you to talk about low or high efficiency? The analysis does not reveal what this or that calculated efficiency value may be related to, which factor or the computing device used is decisive for the result? All tables must be aligned with the above list of methods, use a single language, and specify the units of measurement. The text should explain how quantitative estimates were calculated, how safe ones were identified, and how the proportion of safe ones was estimated. It is not clear on the basis of which criteria the mentioned performance assessment is carried out. It is necessary to check the correctness of the output data in the bibliography. The article will be of interest to readers whose research interests lie in the field of evaluating attack methods. The article can be published after correcting the wording and making edits.

Third Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

Journal: Software Systems and Computational Methods Topic: Analysis of the impact of input data obfuscation on the effectiveness of language models in detecting hint injection, the relevance of the topic is confirmed by a sufficient number of references to scientific publications from 24 sources. Recently, with the development of the Internet, information security has become very important. Special injection detection models are being developed to counter threats. However, attackers are also developing new methods of obfuscation to circumvent such defense mechanisms. Obfuscation means changing the structure and content of a text in order to make it more difficult to analyze and protect it. The objects of research are language models used in various web applications of natural language processing. The subject of the research is the methods of industrial obfuscation and their consequences on information security. The article is written in a competent technical language that is understandable to experts in this subject area. The aim of this work is to analyze known obfuscation methods and their effect on bypassing the detection models of prompt injections. Various approaches have been chosen as research methods, such as paraphrasing, mixing letters in words, introducing special non-printable characters, etc. The article is structured, contains a brief review of the literature, a description of the models and methods used. The absolute advantage of the research is to conduct a computational experiment. Three well-known models have been selected for testing, as they are specially trained to recognize dangerous queries. Its methodology is described, and important comparative results are obtained for different obfuscation models. The experiment aims to identify differences in the effectiveness of different models in processing obfuscated and non-obfuscated instructions. This will help identify weaknesses in information security and reduce the risk of external intrusion. The conclusion is written correctly, and the prospects for further research are indicated. Practical recommendations on the application of the results of the performed computational experiment are given. The prospects for further research include the expansion of the experimental base due to a larger number of language models. Based on the new data obtained, it is planned to identify the most effective methods of obfuscation that minimize the likelihood of an external information threat using new protective mechanisms. No critical comments were found. Recommendations: 1. Include the keyword "information threat" in the topic, depending on the specialty in which the thesis is planned to be defended. 2. The article uses a number of jargons that degrade the readability of the text, for example, "jailbreaks", "exploits". 3. The abbreviations LLM, BERT are not deciphered, they are not generally accepted. 4. Figure 1 is of poor quality in terms of text size. 5. Typo in the "Expected results" section. Conclusion: The article is recommended for publication in the journal Software Systems and Computational Methods and will be of interest to a wide readership.