Propose and implement a Rule-Based System to Predict Crop Yield Production

Pramod Kumar Dwivedi1*, Dr. Prabhat Pandey2

1 Research Scholar, Awadhesh Pratap Singh University, Rewa, Madhya Pradesh, India

Email: ashwanikhajuraho@gmail.com

2 Professor, Department of Computer Science, Awadhesh Pratap Singh University, Rewa, Madhya Pradesh, India

Abstract - Accurate meteorological forecasting plays a pivotal role in agricultural decision-making, particularly in determining crop yield potential and optimizing agricultural practices. This study investigates the application of data mining techniques in meteorological forecasting to enhance crop yield prediction accuracy Preliminary findings suggest that data mining techniques, including machine learning algorithms, neural networks, and ensemble methods, offer significant potential for improving the accuracy and reliability of meteorological forecasts for crop yield prediction. In conclusion, this study underscores the potential of data mining techniques in improving meteorological forecasting for crop yield potential assessment.

Keywords: Predict, Crop, Yield, Production and Climate

INTRODUCTION

There are many forces at work in agriculture, but one of the most important is climate change. Farmers are already feeling the effects of this global warming, and the weather is a major predictor of crop yields. Crop yields, animal populations, and ecological systems will all be more or less impacted by climate change, the precise nature of which varies around the globe. An increasing number of people are worried about the future of food production on a worldwide scale because of the predicted and actual consequences of climate change on farming and food safety. Increases in average global temperature, shifts in precipitation patterns, desertification, water body drying out, flooding, and deteriorating soil conditions all have an influence on agricultural production, crop yield, and the viability of agricultural products. Climate change also promotes the growth of pests and diseases.

Because agricultural production is a bio-socio-system including the interplay of soil, air, water, and crops, a complete model is required, which can be achieved only by classical engineering knowledge. The United Nations Food and Agriculture Organisation defines crop forecasting as the practice of estimating future crop output and yields many months before harvest. Meteorological, agro-meteorological, soil, remotely sensed, and agricultural statistics data are the cornerstones of crop forecasting philosophies. Several indices are developed from meteorological and agronomic data that are considered important factors in predicting crop production. These include crop water satisfaction, surplus and excess moisture, average soil moisture, and more. The linear regression model describes the mathematical connections that are inherent to the dataset derived from earlier trials. If the data used to build and test the model is comprehensive, this strategy may provide findings in a variety of contexts. However, there is a lack of full and sparse information in agricultural statistics. This constraint is why the linear regression method is the de facto standard for yield prediction across expansive areas.

Multivariate regression models are the most studied crop-yield-weather models statistically. Data mining is a process that tries to discover interesting and useful patterns in data for farmers. The inability to accurately anticipate yields is a typical issue.

Both the Indian economy and Indian culture have long relied on agriculture. Involvement of the country's biggest population, whether via direct or indirect work, self-employment, or partial employment, results in a substantial contribution to GDP. Modern society owes its very existence to agriculture. Agriculture, the practice of cultivating domesticated species for the purpose of producing food surpluses, was a critical innovation in the emergence of sedentary human civilization. The general definition of agriculture is the study of farming and other land-based pursuits. The practice of raising various living things for human subsistence and improvement is known as agriculture, farming, or husbandry. This includes not just animals but also plants and fungus. Plantations and crops are the principal products of cultivation. Animals are raised for meat, wool, and other goods. Aqua-products, such as fish, are also produced by agriculture.

LITERATURE REVIEW

Kamir (2020) Using climatic records and NDVI time-series data analysis, uses machine learning to forecast wheat yield, capitalising on a large training dataset of high-resolution yield maps. Research shows that the most accurate method is support vector regression using radial basis functions. The results of this study demonstrated that non-linear models outperformed linear ones, whereas ensembles failed to outperform individual models.

Nevavuori (2019) The model for crop yield prediction is built using Convolutional Neural Networks (CNNs), a deep learning technique that has shown to be very effective in image classification applications. The model is trained using RGB and NDVI data collected from UAVs. We evaluate how changing several CNN parameters, such as network depth, training technique, hyperparameter tuning, and regularisation approach, affect prediction efficiency. During the growth season, CNN and L2 regularisation MAE and MAPE were computed using the Adadelta training method. According to the findings, CNN architecture outperformed NDVI data when fed RGB data.

Sharma, Negativeya (2019) In this work, we surveyed data mining approaches that have been used to estimate agricultural yields. For the purpose of predicting agricultural yields, it summarises the works of the algorithm that previous writers have used. Researchers find it highly useful for gathering data on the present state of data mining methods and applications applicable to the agricultural sector, as it consolidates the work of several writers in one location.

Chlingaryan (2018) This paper discusses research developments conducted in the past 15 years on machine learning techniques for crop yield prediction and nitrogen management. The outcome of this research shows that ML techniques and sensing technologies provide cost-effective and solutions for better decision making and crop estimation.

Ravneet Kaur Sidhu et al (2020) Many people throughout the globe rely on rice as a staple crop. Because of the high-water consumption required for its production, effective water management is essential for the long-term viability of this resource. But there is a severe lack of information on the water use of rice irrigation. Predicting the daily watering plan of rice has been done using typical machine learning approaches in this work. In order to train and optimise the models, we utilise data from 2013 to 2015. The models are tested using data from 2016-2017. Feature selection using correlation criteria helps reduce the number of input parameters from 26 to 11 in the end. Based on the predicted weather conditions, the models calculated the amount of water needed by the crops. Compared to other models, Adaboost's average accuracy in forecasting the irrigation schedule was 71%, demonstrating consistent well-performance.

RESEARCH METHODOLOGY

The data used in this suggested study comes from 28 separate districts in Chhattisgarh and covers the period from 2000 to 2015. Total precipitation The Meteorological Department likewise gathered data for the same 28 districts over a period of 15 years. One kind of data mining is data preparation, which entails making raw data more understandable. The purpose of preprocessing is to reduce the dataset's dimensionality. The process of dimensionality reduction is described Let E be a dataset that contains n data vectors and is represented as a matrix with dimensions n×m. yi where (i ∈ {1, 2,...n}) as a function of dimension DI. After that, the data is sent into Matlab's curve fitting tool for principal component analysis (PCA) to determine which independent and dependent variables fit together best. As a regression tool, Support Vector Machine maintains all the key characteristics (maximum margin) that characterised the method. With a few key differences, the Support Vector or Regression (SVR) and the Support Vector Machine (SVM) share many of the same ideas for categorization.

RESULT

THE EFFECT OF RAINFALL ON CROP YIELD

Growing cotton uses around 3% of the world's irrigation water, although only 2.2% of the world's arable land is actually dedicated to this crop. So, a lot of irrigation water goes into growing cotton. Unless you're using a dry farming cotton production strategy, the water needed for crops comes from both natural precipitation during harvest and artificial irrigation in the months thereafter.

The Food and Agriculture Organisation (2012) defines water usage as the amount of water actually evaporated from the soil. Two distinct processes, evaporation and transpiration, work together to form evapotranspiration, which is the mechanism by which water is lost from the soil surface and consumed by the crop. It gives a quantification of the overall water consumption throughout crop growth in the field or the water loss from extraction to crop supply. ETa ranges from 410 to 780 mm seasonally, with the exact amount dependent on the irrigation technique and the amount of deficit irrigation (of precipitation). A sufficient amount of water is needed to enable 700 ml of evapotranspiration (the sum of soil evaporation and transpiration) in order for cotton to achieve satisfactory yields. Soil type and other soil properties; weather conditions (such as precipitation, relative humidity, wind speed, etc.); and other environmental elements are the determinants of ETa.

There are three main aspects of water management: the amount of water extracted for irrigation, how efficiently that water is used to grow cotton, and the quality of the water, both when it is applied to the crop and when it (possibly) leaves the farm through deep percolation or surface runoff. Figure 1 shows that the water need of cotton crops changes as they develop through their various phases. What really affects cotton crop yield is not the overall amount of precipitation, but rather the amount that falls throughout the three distinct phases of growth: germination, fruiting, and maturity.

Figure 1: Graph of optimum water needs of cotton during growth period

Reduced photosynthetic activity and increased leaf senescence are symptoms of water stress, which is caused by a lack of water. A reduction in blooming is caused by the significant shedding of tiny squares caused by drought stress. Although boll abscission occurs in response to water stress in the first fourteen days after anthesis (the beginning of a plant's blooming season, beginning with the opening of the flower bud), neither flowers nor huge squares or balls shed very often. Thus, it is not uncommon for immature plants to keep flowering even when subjected to extreme stress. Ballast size and seed weight are both affected by water stress in the 20-30 days after anthesis.

Severe heat in the afternoon is a common cause of water stress. When exposed to temperatures that are too high or too low, cotton plants and the fibres they produce suffer.

The impact of a delayed or failed monsoon may be mitigated if farmers own irrigation infrastructure. Both the vegetative and fruiting stages of a cotton crop are stunted when water is unavailable in the root zone. Total output is affected by growth stage, moisture shortage level, and water stress period.

Data Classification

The current research examined the impact of rainfall on cotton output by treating it as an independent variable. Therefore, cases of cotton production were compared to rainfall. Using secondary sources, we were able to compile ten-year rainfall averages for all three districts' talukas. After looking at the rainfall data from 2006 to 2015, the talukas in the three districts that were chosen were categorised as talukas with 'High,' 'Medium,' or 'Low' rainfall. Table 1 and 6.2 provide the specifics of the basis for categorization.

Table 1 Classification of Talukas on the basis of average rainfall

Rain classification
	Balrampur	Bastar	Durg
Low	up to 93	up to 82	up to 116
Medium	94 to 132	83 to 126	117 to 141
High	Above 133	Above 127	Above 142

The Non-Parametric Test

Parametric tests, contrary to popular belief, may actually work really effectively with non-normal continuous data provided that the sample size requirements are satisfied. Although nonparametric tests do not presume normally distributed data, they do entail additional, more challenging assumptions. A typical assumption for non-parametric tests comparing groups is that all groups' data must have the same spread (dispersion). Nonparametric tests may not provide reliable findings if the distributions of your groups are different. The term "distribution-free statistics" applies to non-parametric statistics that do not make any assumptions on the population's distribution. As a result, data with a large range of variation may be readily accommodated. These distribution-free tests are applicable to both quantitative and qualitative data, in contrast to parametric statistics.

The Wilcoxon Signed Rank Test

We used either the Wilcoxon signed rank test or the Wilcoxon matched pairs test to determine if rainfall had an influence on cotton yield. It is recommended to use the Wilcoxon signed rank test since the data is constituted of definite scores. When the same people are tested under two distinct circumstances, it works well as a repeated measure design. The yield and rainfall data sets were tested for the same taluka. The paired t-test in parametric form has a nonparametric counterpart. The degree to which the two pairings vary is taken into account by this test. Compared to the basic sign test, it makes more use of the data included in the score sets. It is thought to be more accurate than the sign test due to the fact that it utilises more information. For this test to work, it is essential that the two sets of scores being considered be connected and on an ordinal scale.

In the first set of numbers, we can see the ten-year average rainfall over all three districts and talukas. As a duo, they used cotton yield data from several years. We looked at the rainfall and yield statistics side by side to see how much of a difference there was. Table 2 displays the findings.

Table 2 Wilcoxon Signed-Rank Test Result for rainfall

Wilcoxon Signed-Rank Test
Treatment 1	Treatment 2	Sign	Abs	R	Sign R
150.12	680	-1	529.88	23	-23
145.85	879	-1	733.15	30	-30
96.67	598	-1	501.33	19	-19
65.27	691	-1	625.73	29	-29
201.63	561	-1	359.37	8	-8
152.55	597	-1	444.45	14	-14
104.7	643	-1	538.3	25	-25
180.63	599	-1	418.37	11	-11
103.82	234	-1	130.18	1	-1
155.2	717	-1	561.8	26	-26
152.52	467	-1	314.48	6	-6
141.83	440	-1	298.17	4	-4
72.17	470	-1	397.83	10	-10
95.22	594	-1	498.78	18	-18
155.13	639	-1	483.87	17	-17
146.68	667	-1	520.32	21	-21
153.3	758	-1	604.7	28	-28
180.87	702	-1	521.13	22	-22
74.9	613	-1	538.1	24	-24
145.2	655	-1	509.8	20	-20
188.63	766	-1	577.37	27	-27
155.07	389	-1	233.93	3	-3
115.33	508	-1	392.67	9	-9
88.6	558	-1	469.4	16	-16
137.42	450	-1	312.58	5	-5
106.77	550	-1	443.23	13	-13
98.47	531	-1	432.53	12	-12
170.87	487	-1	316.13	7	-7
130.93	351	-1	220.07	2	-2
125.43	590	-1	464.57	15	-15

There was a significant difference between the rainfall and cotton yield pairings according to a Wilcoxon signed rank test; with n = 30, Z = -4.7821, and p < 0.05. There is a noticeable disparity in size between the two groups.

Pearson’s Correlation Coefficient

Also, we tried to figure out how strongly the two variables were related to each other. The Pearson's Correlation Coefficient was used for this objective. To find out how closely related or associated two continuous variables are statistically, statisticians use Pearson's correlation coefficient. Because it is based on the concept of covariance, it is renowned as the finest way to measure the relationship between variables of interest. It reveals not just the direction of the link but also its strength, sometimes called a correlation.

The likelihood that you would have obtained the present result under the null hypothesis, when the correlation coefficient is 0, is known as the P-value. The correlation coefficient is deemed statistically significant if the probability is less than the standard 5% (P < 0.05). Exactly what magnitude of connection is deemed high, moderate, or weak is not defined by any rule. The values can never be more than -1 (very negative correlation) or +1 (very positive correlation). Values around 0 indicate a weak or nonexistent linear connection. We normally consider correlations over 0.4 to be quite high for this kind of data. Moderate correlations are those between 0.2 and 0.4, while weak correlations are those below 0.2. Table 3 displays the findings.

Table 3 Pearson Correlation Coefficient Result for rainfall

Pearson Correlation Coefficient
x value	y value	X - Mx	Y - My	(X - Mx)2	(Y - My)2	(X - Mx) (Y - My)
150.12	680	17.061	100.533	291.066	10106.951	1715.166
145.85	879	12.791	299.533	163.601	89720.218	3831.231
96.67	598	-36.389	18.533	1324.184	343.484	-674.416
65.27	691	-67.789	111.533	4595.394	12439.684	-7560.77
201.63	561	68.571	-18.467	4701.936	341.018	-1266.272
152.55	597	19.491	17.533	379.886	307.418	341.736
104.7	643	-28.359	63.533	804.252	4036.484	-1801.763
180.63	599	47.571	19.533	2262.968	381.551	929.214
103.82	234	-29.239	-345.467	854.939	119347.218	10101.215
155.2	717	22.141	137.533	490.209	18915.418	3045.08
152.52	467	19.461	-112.467	378.718	12648.751	-2188.676
141.83	440	8.771	-139.467	76.925	19450.951	-1223.216
72.17	470	-60.889	-109.467	3707.511	11982.951	6665.352
95.22	594	-37.839	14.533	1431.815	211.218	-549.932
155.13	639	22.071	59.533	487.114	3544.218	1313.94
146.68	667	13.621	87.533	185.523	7662.084	1192.262
153.3	758	20.241	178.533	409.685	31874.151	3613.634
180.87	702	47.811	122.533	2285.86	15014.418	5858.4
74.9	613	-58.159	33.533	3382.508	1124.484	-1950.276
145.2	655	12.141	75.533	147.396	5705.284	917.025
188.63	766	55.571	186.533	3088.099	34794.684	10365.782
155.07	389	22.011	-190.467	484.469	36277.551	-4192.298
115.33	508	-17.729	-71.467	314.329	5107.484	1267.056
88.6	558	-44.459	-21.467	1976.632	460.818	954.394
137.42	450	4.361	-129.467	19.015	16761.618	-564.561
106.77	550	-26.289	-29.467	691.129	868.284	774.659
98.47	531	-34.589	-48.467	1196.422	2349.018	1676.43
170.87	487	37.811	-92.467	1429.647	8550.084	-3496.226
130.93	351	-2.129	-228.467	4.534	52197.018	486.482
125.43	590	-7.629	10.533	58.207	110.951	-80.362
		Mx: 133.059	My: 579.467	Sum: 37623.972	Sum: 522635.467	Sum: 29500.289

The variables were positively associated with each other, with an R-value of 0.2104 and a p-value less than.05.

Figure 2 shows the dispersion graph of 30 data pairs, and Figure 3 shows the average rainfall over a ten-year period. The PCC for these pairings was found to be 0.2104 (p<0.05). A straight line is shown by the scattered layout, despite the dots being all over the place.

Figure 2 Scattered graph for rainfall and yield correlation

Figure 3 Average rainfalls for ten years from 2006 to 2015

Weka Classification and Calculations

Data mining (machine learning) classification is a method for predicting which groups data examples will belong to. Finding a model for a class attribute based on the values of other characteristics and then accurately assigning test data to those classes is the task at hand. Model creation, or specifying a collection of established classes, is the first stage in classification. The second step is to use that model for prediction, or categorizing future or unknown things.

There are two types of classifiers in WEKA: supervised and unsupervised. None of the classifiers such as sluggish, tree, rules, and naïve fall outside of these specific groups. To further improve the precision of classifiers trained with different assemblers, meta classifiers are also available. In Figure 4, we can see the stages of the WEKA classification algorithm.

Figure 4 WEKA Classification Steps (Ref. Shweta Srivastava (2014))

Using WEKA's cross validation features and the goal variable "Yield," several classification algorithms were applied to rainfall data. Multiple methods were used for the assessment, including Multilayer Preceptron (MLP), Additive Regression (AR), Kstar, Sequential Minimal Optimisation (SMO), and Gaussian Processes (GP).

In GP, the distribution is over the mean and covariance functions, and the classifier function does not need hyperparameter adjustment; it is a probabilistic approach to regression and classification. ML-P utilises back propagation, a supervised learning method, to classify cases. It is a feed-forward multi-layer neural network classifier. Many applications rely on MLPs for tasks such as pattern recognition, estimation, prediction, and classification. The Support Vector Machine (SVM) is used for regression in the SMOreg method. A line that effectively divides the training data into classes is identified by the support vector machine. Meta classifier additive regression (AR) improves a regression base classifier's performance. It's not just an algorithm; it's based on a Bayesian probability model. For the rainfall and yield data, Table 4 displays the computed coefficients and Root Mean Square Errors (RMSE).

Table 4 WEKA algorithms result for rainfall

WEKA Algorithm	Correlation co-efficient	RMSE	RMSE Cross Valid
Gaussian Processes	0.2743	1.169307	1.318599
MLP	0.2881	1.735576	1.448596
SMOreg	0.2743	1.070578	1.366652
Kster	0.2752	1.52543	1.544968
Additive Regression	0.2507	1.778639	1.762545

The table shows that the RMS errors, correlation coefficients, and features of the algorithms that were tested for rainfall and cotton crop output varied. We focused on the correlation coefficient and root-mean-squared errors (RMSEs). In the cross validation run, GP had the best performance across all three metrics, with a correlation of 0.2743, an RMSE of 1.169307, and an RMSE of 1.318599.

CONCLUSIONS

Using a dimension reduction method to narrow the focus of the collected data, this study offered a machine learning model for predicting crop yields; this model would exclude information that may compromise the quality of the model. We employed principal component analysis (PCA) to preprocess the input dataset, and then we used the K-medoid clustering approach to get better predictions. Estimates obtained with WEKA make use of the linear regression method. When day-based forecasting is sought for, satisfactory outcomes are not achieved. But when we wanted the monthly average, we got the same results. So, it is clear that using a monthly average in the linear regression method for weather forecasting in WEKA produces good results.

REFERENCES

Kamir, E. W. (2020). Estimating wheat yields in Australia using climate records, satellite image time series and machine learning methods. Kamir, E., Waldner, F., & Hochman, Z. Estimating ISPRS Journal of Photogrammetry and Remote Sensing, 160, 124–135
Nevavuori, P. N. (2019). Crop yield prediction with deep convolutional neural networks. Computers and Electronics in Agriculture, 1-9.
Chlingaryan, A. S. (2018). Machine learning approaches for crop yield prediction and nitrogen status estimation in precision agriculture: A review. Computers and Electronics in Agriculture, 151, 61–69.
Sidhu, Ravneet & Kumar, Ravinder & Rana, Prashant. (2020). Machine learning based crop water demand forecasting using minimum climatological data. Multimedia Tools and Applications. 79. 13109-13124. 10.1007/s11042-019-08533-w.
D, Dr & Kamate, Nikhil & Gawade, Om & Matruprasad, Pinaki & Sai, Vamsi. (2024). Soil Analyser - Revolutionizing Agriculture through Wireless Sensor Networks and Machine Learning. International Journal for Research in Applied Science and Engineering Technology. 12. 85-88. 10.22214/ijraset.2024.58248.
Afrin, Sadia & Khan, Abu & Mahia, Mahrin & Ahsan, Rahbar & Mishal, Mahbubur & Ahmed, Wasit & Rahman, Mohammad. (2018). Analysis of Soil Properties and Climatic Data to Predict Crop Yields and Cluster Different Agricultural Regions of Bangladesh. 80-85. 10.1109/ICIS.2018.8466397.
Sakthipriya, S. & Naresh, R.. (2023). Precision agriculture: crop yields classification techniques in thermo humidity sensors. Optical and Quantum Electronics. 56. 10.1007/s11082-023-05907-1.
Gowda, Shruthi & Reddy, Sangeetha. (2020). Design And Implementation Of Crop Yield Prediction Model In Agriculture. International Journal of Scientific & Technology Research. VOLUME 8,. 544.
Maqsood, Junaid & Farooque, Aitazaz & Abbas, Farhat & Esau, Travis & Wang, Xiuquan (Xander) & Acharya, Bishnu & Afzaal, Hassan. (2022). Application of Artificial Neural Networks to Project Reference Evapotranspiration Under Climate Change Scenarios. Water Resources Management. 36. 1-17. 10.1007/s11269-021-02997-y.
Afzaal, Hassan & Farooque, Aitazaz & Abbas, Farhat & Acharya, Bishnu & Esau, Travis. (2020). Computation of Evapotranspiration with Artificial Intelligence for Precision Water Resource Management. Applied Sciences. 10. 10.3390/app10051621.
Thomas van Klompenburg, Ayalew Kassahun, Cagatay Catal, Crop yield prediction using machine learning: A systematic literature review, Computers and Electronics in Agriculture, Volume 177, 2020, 105709, ISSN 0168-1699, https://doi.org/10.1016/j.compag.2020.105709.
Mohanadevi M et al (2018) " A Study on Various Data Mining Techniques for Agriculture Crop Yield Prediction", International Journal of Emerging Technologies and Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.5, Issue 9, page no.797-802, September-2018, Available: http://www.jetir.org/papers/JETIR1809626.pdf
C, Mithra & Suhasini, A. (2023). Fertilizer type and quantity recommendation to increase oilseed crops yield prediction with inorganic fertilizers using machine learning algorithms. Ecology, Environment and Conservation. 29. 358-372. 10.53550/EEC. 2023.v29i02s.059.
Shah, Sayed. (2021). Machine Learning based Crop Recommendation System for Local Farmers of Pakistan. Revista Gestão Inovação e Tecnologias. 5735-5746. 10.47059/revistageintec. v11i4.2613.
Katuru, K. & Surapaneni, Ravi & Dasari, Suresh. (2020). Predicting Crop yield and Effective use of Fertilizers using Machine Learning Techniques. International Journal of Innovative Technology and Exploring Engineering. 9. 1288-1292. 10.35940/ijitee. G5911.059720.