Ðóñ Eng Cn Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Taxes and Taxation
Reference:

Forecasting the tax burden of agricultural enterprises by machine learning methods

Kharitonova Anna Evgen'evna

ORCID: 0000-0001-8480-6279

PhD in Economics

Associate Professor, Department of Taxes and Tax Administration, Financial University

127083, Russia, Moscow, Moscow, Verkhnyaya Maslovka str., 15

kharitonova.ae@yandex.ru

DOI:

10.7256/2454-065X.2023.4.43917

EDN:

VUBDLU

Received:

28-08-2023


Published:

05-09-2023


Abstract: The article analyzes the data of a set of agricultural enterprises and builds machine learning models to predict the tax burden. The subject of this study is a system of statistical indicators of agricultural enterprises that characterize the level of tax burden. The purpose of the study is to predict the tax burden using machine learning methods. The introduction of modern artificial intelligence tools is an integral and inevitable process in all spheres, including in the tax environment. Machine learning methods were used to build models: regression analysis, decision tree, random forest, gradient boosting. Models of forecasting the tax burden depending on a set of factors were built. The high quality of tax burden forecasting models will make it possible to more accurately assess the financial condition of enterprises, calculate profitability, predict profitability and make informed investment management decisions. As a result of forecasting the tax burden, the gradient boosting machine learning model turned out to have the best quality. In general, the model allows you to predict the tax burden better than traditional econometric models and make high-quality forecasts. The introduction of modern forecasting tools based on artificial intelligence methods will allow obtaining highly accurate forecasts with minimal time, which will increase the efficiency of enterprises and the level of production.


Keywords:

tax burden, tax planning, tax forecasting, tax management, financial management, machine learning methods, decision tree, random forest, gradient boosting, regression models

This article is automatically translated. You can find original text of the article here.

Introduction

There is often a conflict of interests between tax authorities and taxpayers, since the goal of the former is to provide budgets of all levels with maximum tax revenues, and the latter is to minimize tax liabilities to maximize their income [1]. That is why the establishment of an optimal tax burden for organizations contributes to the development of their activities and their improvement of tax discipline. On the other hand, the tax burden is a significant indicator of the state of the economy. A high tax burden can have a negative impact on economic activity and investment, while a low tax burden can lead to underfunding of the state budget. The size of the tax burden depends on the specific tax system, policy and priorities of state regulation [2]. In this regard, the task of any state is to determine the optimal level of tax burden, which allows ensuring parity of interests of the budget and business.

Forecasting the tax burden is of great relevance in the modern world, especially in an unstable economic situation and frequent changes in tax legislation. The forecast values of the tax burden allow state and local self-government bodies to competently plan their budget revenues and expenditures for the future period. This is important for the development of effective strategies for financing social programs, defense, education and other public needs. The government and other interested parties should analyze the current situation and make informed decisions on tax policy. Forecasts help to determine optimal tax rates, investigate the impact of changes in tax legislation and evaluate the effectiveness of tax incentives and benefits. In addition, when comparing the actual indicators of the tax burden of an economic entity with the average industry, tax services identify potential cases of non-declaration of income or tax evasion. This contributes to a fair distribution of the tax burden and allows the State to collect the necessary revenues to ensure its functions and obligations. The relevance of assessing and forecasting the tax burden for entrepreneurs is justified by the possibility of planning their activities and calculating business indicators. Knowledge of the estimated tax burden makes it possible to more accurately assess the financial condition of the enterprise, calculate profitability, predict profitability and make informed investment decisions [3].

In general, forecasting the tax burden is an important tool for the state, business and society as a whole. It makes it possible to identify potential problems, prevent abuse and optimize the state of the tax system to ensure sustainable economic development and social well-being.

The use of modern tools and software for forecasting the tax burden will improve the quality of the values obtained, reduce the time spent on data processing and building models and increase the efficiency of economic activity of enterprises.

Literary review

The current state of the economy requires improving the quality of forecasting both at the macro level and at the level of an individual enterprise. In post-pandemic and sanctions conditions, it is necessary to be able to qualitatively and promptly predict risks for competent management in conditions of difficult circumstances and uncertainties in economic activity [9].

In conditions of uncertainty and lack of information, many researchers use the expert method in their work. Thus, Treshchevsky Yu.I., Kosobutskaya A.Yu. and Opoikova E.A. used an expert method of forecasting the impact of economic anti-sanctions measures on the economy of the region[10]. Also in the work of T.I. Zueva, an expert method was used to predict the parameters of innovative development of enterprises [6]. However, expert methods are quite difficult to adapt when changing conditions and updating data (especially relevant for the tax environment), which requires the use of new tools and analysis tools.

For the construction of forecasting models, the use of statistical methods is traditional. So, Khramtsova T.G. and Khramtsova O.O. use statistical methods to predict the financial results of the enterprise.

One of the most popular forecasting methods is correlation and regression analysis. For example, Kostina Z.A. and Mashentseva G.A. used correlation and regression analysis to predict the tax revenues of the budget of the subject of the Russian Federation [13], Kuzina E.I. – for forecasting tax revenues of the Ryazan region [14].

Yablokov D.Yu. compares in his study the effectiveness of forecasting by the ARIMA method and using a neural network and gives preference to the latter with sufficient persuasiveness [15]. Artificial intelligence methods are actively used to predict the bankruptcy of enterprises and the financial condition [16, 17], and in the work of S.S. Ivankova, a decision tree model is used to assess the risk of bankruptcy [18].

Artificial intelligence methods are also used for forecasting in the agricultural sector of the economy. The possibilities of using artificial intelligence and neural network technologies in a digital platform for the breakthrough development of the Russian agro-industrial complex are considered in the work of Ilyshov A.P. and Tolmachev O.M. [19]. Machine learning methods are also used to predict the level of equipment of agricultural enterprises [20, 21].

Nevertheless, as the literature review has shown, the scope of application of machine learning methods in forecasting the tax burden has not been studied, there is no comparison of the results of such methods with classical correlation and regression analysis, which is what the practical part of the study is devoted to.

Materials and methods of research

The input information for processing was the accounting data for 20,000 agricultural enterprises in Russia for 2021. The initial dimension of the input data was 20,000 rows by 138 columns. The data contains missing values, qualitative variables and outliers. At the stage of preliminary data processing, work was carried out with the missing values. For a number of indicators of the form 1-2 of the accounting statements, data are entered without fail, so skipping means that this indicator is 0. For the remaining columns, indicators with more than 5% of omissions are removed, then all rows with at least 1 omission are removed. As a result of the deletion, 2,179 enterprises were excluded from the sample.

From the point of view of statistics, it is advisable to compare the units of the population by relative indicators. The data characterizing economic activity should be correlated with the resources of the enterprise. For agricultural enterprises, it is best to correlate data with the area of agricultural land. However, this indicator is missing from the source data, so we will correlate the data with the average annual number of employees, as well as calculate possible relative indicators of the enterprises' activities:

- Tax burden, %;

-          Company age, years;

- Net assets per 1 employee, RUB.;

-          Accounts payable per 1 employee, RUB.;       

- Stock ratio, rub/person.;

- Share of non-current assets in total;

- Turnover ratio of total assets;

- The coefficient of concentration of equity (autonomy);

- The share of cost as a percentage of revenue.

Before identifying the factors of the tax burden, data diagnostics for the presence of emissions was carried out, this made it possible to make the aggregate homogeneous. Thus, no emissions were detected for the tax burden (Figure 1), but emissions were present in the data for a number of other indicators. Using the three sigma rule, all businesses that do not fall within this interval are deleted.

Èçîáðàæåíèå âûãëÿäèò êàê äèàãðàììà, Ãðàôèê, ëèíèÿ, ñíèìîê ýêðàíà  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Figure 1 – Graphs of emissions diagnostics in the source data

As a result of getting rid of the missing values and emissions, a total of 15015 enterprises remained.

Machine learning models were used as research methods. To solve regression problems, the following algorithms were used in the study:

· Decision trees;

·         Random Forest;

· Gradient boosting;

· Neural networks.

The Decision Tree is a binary recursive nonparametric procedure that allows processing quantitative and qualitative input and output quantities in their original, raw form [4]. The goal is to create a model that predicts the value of a target variable by studying simple decision-making rules derived from data characteristics. Each node of the tree is a check for various conditions for a certain variable, the branches of the tree are the result of the check, and the end nodes are the decision made after calculating all the attributes [5]. To predict the tax burden, the decision tree can be used as an algorithm for finding the most significant factors in order to obtain the most accurate forecast.

Boosting (AdaBoost) is a procedure for building algorithms, when each next one tries to compensate for the shortcomings of the previous ones. He creates a forecasting model in the form of ensembles of weak forecasting models, usually decision trees; builds the model in stages, generalizes them, allowing optimizing an arbitrary differentiable loss function [6]. For the purposes of forecasting the tax burden, gradient boosting provides high-quality forecasts based on non-trivial partitions developed by the decision tree algorithm.

Random Forest is an algorithm that combines several decision trees based on the idea of ensemble learning. To form each tree in the ensemble, a bagging procedure is implemented – random selection with repetitions of elements of the training sample into the training subsample [7, 8]. Combining trees makes it possible to obtain more accurate and stable forecasts of the tax burden with nonlinear and non-trivial dependencies.

The Python programming language with the Anaconda distribution in the Jupyter Lab environment was used as a data processing tool. The following packages were used to download and analyze data: numpy, pandas, seaborn, matplotlib, sklearn and tensorflow.

Research results

To determine the relationships between the features, we will build a heat map of the correlation coefficients (Figure 2). There is no direct linear dependence of factors on the tax burden, which is also confirmed by the construction of a multiple linear regression model, the coefficient of determination for which is only 21%. Thus, it is impossible to predict the tax burden according to this model. To be able to predict and identify nonlinear relationships between features, we will build machine learning models.

Figure 2 – Heat map of correlation coefficients

The "Decision Tree" model builds a graph in the form of a structure with nodes in which conditions are set, and leaves with possible solutions. To build a model with the best characteristics, the parameters were selected using the GridSearchCV function. As a result, the quality of the constructed model was not high enough. The coefficient of determination suggests that only 27.4% of the variation in the tax burden can be explained by the influence of the factors included in the model. Also, as metrics of the quality of the constructed models, we will consider the average error and the average absolute error. The average forecast error was 0.04 with an average value of 0.075. The average absolute error between the tax burden predicted by the model and the actual one is 1.1%.  For forecasting purposes, this model is not suitable because of the low coefficient of determination and high error values.

The "Random Forest" model is based on a committee of decision trees and usually shows higher quality. In order to build a model with the best quality, the selection of parameters was also carried out. As a result, the coefficient of determination was 43.4%, which indicates that 43.4% of the variation in the tax burden depends on the factors included in the model. The average error was 0.03 with an average value of 0.075, and the average absolute error between the tax burden predicted by the model and the actual one is 0.95%. In general, the quality of the model turned out to be higher than according to the "Decision Tree" algorithm.

For comparison, we will build a model using a gradient boosting algorithm based on a decision tree. In order to find a model with the best quality, the parameters were selected, as a result, the coefficient of determination was 46.2%. That is, 46.2% of the variation of the variable (tax burden) depends on the factors included in the model. Let's compare the quality metrics of the constructed models (Table 1). The determination coefficient for the gradient boosting model turned out to be the highest, it is 2.9% higher than the Random Forest model. Also, the higher quality of the model is confirmed by the average error (0.033) and the average relative error (0.847), which are lower than in the "Decision Tree" and "Random Forest" models. To obtain accurate and reliable forecasts, it is necessary to increase the coefficient of determination in the models, but it is possible to estimate possible values of the tax burden using the gradient boosting model.

Table 1 – Evaluation of the quality of the constructed machine learning models

Regression model

Coefficient of determination (R2)

Average Error (MAE)

Average Absolute Error (MAPE)

The "Decision Tree" model

(DecisionTreeRegressor)

0,274

0,040

1,104

The "Random Forest" model

(RandomForestRegressor)

0,434

0,035

0,949

The "Gradient Boosting" model

(HistGradientBoostingRegressor)

0,463

0,033

0,847

 

According to the best model constructed, let's compare the predicted values of the tax burden for three randomly selected enterprises from the sample.For the first enterprise, the tax burden is projected to be 7.6%, which is 4.1% higher than the actual value, i.e. the error is quite high. For the second company, the projected tax burden was 5.7% at the actual level of 3.9% (a difference of 1.8%).  For the third company, the forecast was 3.1% of the tax burden, with an actual level of 6.7% (a difference of 3.1%). Thus, it should be noted that machine learning methods give better results than traditional regression models, but in order to obtain better and more reliable forecasts, models should be improved and refined. One of the further ways to improve the quality of models can be the division of enterprises into more homogeneous groups based on the results of economic activity and the construction of models for each group separately. In general, the use of machine learning methods is a promising direction for work and their use in tax forecasting and the development of a company's tax strategy.

Conclusions and suggestions In the course of the research, data processing of 20,000 agricultural enterprises was carried out in the Python programming language and models for forecasting the tax burden were built using a multiple linear regression model, a decision tree, a random forest and gradient boosting. When comparing the models, it can be seen that gradient boosting and random forest methods are much superior in quality metrics to linear regression models and decision trees with the same set of input predictors. Thus, it can be noted that the use of machine learning methods improves the quality of forecasting, and therefore they can be implemented in the activities of enterprises.The capabilities of modern artificial intelligence tools allow, other things being equal, to obtain more reliable and high-quality forecasts in comparison with traditional econometric models. At the same time, the use of programming languages specialized in analysis, such as Python or R, will reduce the cost of preprocessing data and building models, which will allow obtaining forecasts for making operational decisions to increase the economic efficiency of activities.It is important to note that the methodological approach presented in this study can be applied not only by economic entities in tax planning and forecasting, but also by tax authorities in determining criteria for identifying objects of close tax control. In particular, the construction of models using machine learning methods will allow us to obtain a list of dependent indicators that can be considered by tax authorities together with a low level of tax burden when selecting organizations for the purposes of on-site tax control.

References
1. Tikhonova, A.V. (2021). Tax burden and other motives for the law-abiding behavior of individuals. Economy. Taxes. Right, 14(2), 169-178.
2. Medyukha, E.V. & Artyushenko, E.V. (2019). The impact of the tax burden on the financial and economic activities of the enterprise. Vector of the economy, 10(40), 14.
3. Goncharenko, A.E., Zvereva, T.V., & Karpova, G.N. (Eds.). (2023). Taxes and the tax system of the Russian Federation. Textbook, Moscow.
4. Wu, X., Kumar, V., & Quinlan, R. (Eds.). (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37.
5. Nasteski, V. (2007). An overview of the supervised machine learning methods. Horizons, 4, 51-62.
6. Shatrov, A.V. & Pashchenko, D.E. (2019). Comparison of classical regression models with models built using advanced machine learning methods. Advanced Science, 1(12), 24-28.
7. Biau, G. (2012). Analysis of a Random Forests Model. Journal of Machine Learning Research, 13, 1063-1095.
8. Kopoteva, A.V., Maksimov, A.A. & Sirotina, N.A. (2021). Machine learning models in the problem of forecasting the natural resource potential of the Perm region. Bulletin of the South Ural State University. Series: Computer technologies, control, radio electronics, 21, 4, 126-136.
9. Polukhina, I.V. (2022). Analysis of risks and on-farm reserves of sustainable development of organizations in the context of unprecedented economic restrictions and new realities of competition. Modern Economics: Problems and Solutions, 5(149), 125-142.
10. Treshchevsky, Yu.I., Kosobutskaya, A.Yu. & Opoykova, E.A. (2022). Forecasting the impact of anti-sanction measures of economic policy on the regional economy. Modern economy: problems and solutions, 8(152), 8-25.
11. Zueva, T.I. (2020). Application of the method of expert assessments in predicting indicators of the innovative potential of an enterprise. Moscow Economic Journal, 6, 82.
12. Khramtsova, T.G. & Khramtsova, O.O. (2021). Forecasting financial results based on statistical methods, Transport business in Russia, 3, 12-15.
13. Kostina, Z.A. & Mashentseva, G.A. (2019). Forecasting tax revenues of the budget of the subject of the Russian Federation using correlation and regression analysis. Siberian Financial School, 5(136), 144-147.
14. Kuzina, E.I. (2021). Application of correlation-regression analysis in forecasting tax revenues in the Ryazan region. Bulletin of the Volga University V.N. Tatishchev, 2(3(48)), 133-142.
15. Yablokov, D.Yu. (2015). Statistical methods of tax forecasting in conditions of uncertainty of the external environment. Problems of economics and management in trade and industry, 2(10), 42-47.
16. Apatova, N.V. & Popov, V.B. (2020). Forecasting the bankruptcy of enterprises using artificial intelligence. Scientific Bulletin: finance, banks, investments, 2(51), 113-120.
17. Vinogradov, A.S. (2022). The use of machine learning in financial forecasting in banks. Topical issues of modern economics, 5, 705-710.
18. Ivankova, S.S. (2022). Evaluation and analysis of bankruptcy risk using a decision tree model of machine learning. Interactive science, 2(67), 44-46.
19. Ilyshev, A.P. & Tolmachev, O.M. (2019). Artificial intelligence and neural network technologies in the digital platform for the breakthrough development of the Russian agro-industrial complex. Economy and society: modern models of development, 9(4(26)), 492-507.
20. Khudyakova, E., Nikanorov, M., Bystrenina, I., Cherevatova, T. & Sycheva, I. (2021). Forecasting the production of gross output in the agricultural sector of the Ryazan oblast. Estudios de Economía Aplicada, 39, 6. doi:10.25115/eea.v39i6.5171
21. Khudyakova, E.V., Nikanorov M.S. & Butyrin V.V. (2021). Problems of analysis and forecasting of the level of technical equipment of agricultural enterprises (on the example of the Ryazan region). Accounting in agriculture, 2, 69-77.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The reviewed article is devoted to the actual problem of forecasting the tax burden. Forecasting the tax burden of business entities today is an insufficiently developed scientific field, which is partly evidenced by the state and dynamics of arrears on tax payments to the consolidated budget of the Russian Federation. The study of the essence of tax forecasting is also caused by the need to solve complex tasks to bring Russia out of crisis conditions, which are characterized by economic instability, economic sanctions and special military operations. In addition, there is a need to improve the forecasting system itself in order to increase the accuracy of the forecasts and take into account the most significant economic factors, in particular investments. The author's innovative approach consists in the application of machine learning methods, which are practically not used in scientific research on tax topics, a number of articles are devoted only to a theoretical description of the possibilities of using this group of methods. In addition, traditionally, enterprises tend to use dynamic data of their company to predict the load at the micro level, while the presented approach is based on the analysis of a set of data from business entities in agriculture. As the main methods, the author uses: decision trees; random forest; gradient boosting; neural networks – in this case, each of the methods allows you to compare its results with another, thereby choosing the most accurate option for forecasting. The advantage of the article also consists in a fairly large sample of objects, which has undergone a preliminary two–stage cleaning and verification (1 – for the presence of outliers, 2 - for the presence of missing values). Working with a large amount of data (more than 15,000 organizations after processing the database) allowed the author to significantly improve the quality of the model and the reliability of forecasts. The scientific novelty consists in the development and testing of a methodological approach to forecasting the tax burden of an economic entity based on the use of machine learning methods and having high statistical stability. The style of the article is scientific, corresponds to works of this level. The article is well structured, it contains the following sections: Introduction, Literary review, Research materials and methods, Research results, Conclusions and suggestions. This corresponds to the IMRAD structure. The research is well-structured, has an internal logic and unity. The bibliography includes 20 items, including scientific articles on domestic and foreign studies. The author conducted a review of scientific opinions on the application of various forecasting methods. There is no appeal to the opponents, but this is logical, based on the fact that "as the literature review showed, the scope of application of machine learning methods in forecasting the tax burden has not been studied." Thus, the presented article has elements of scientific novelty, theoretical and practical significance. It is a non-standard transdisciplinary scientific study representing a synthesis of financial and statistical sciences, and will be interesting for the readership.