Yield modeling of snap bean based on hyperspectral sensing: a greenhouse study

Abstract. Farmers and growers typically use approaches based on the crop environment and local meteorology, many of which are labor-intensive, to predict crop yield. These approaches have found broad acceptance but lack real-time and physiological feedback for near-daily management purposes. This is true for broad-acre crops, such as snap bean, which is valued at hundreds of millions of dollars in the annual agricultural market. We aim to investigate the relationships between snap bean yield and plant spectral and biophysical information, collected using a hyperspectral spectroradiometer (400 to 2500 nm). The experiment focused on 48 single snap bean plants (cv. Huntington) in a controlled greenhouse environment during the growth period (69 days). We used applicable accuracy and precision metrics from partial least squares regression and cross-validation methods to evaluate the predictive ability of two harvest stages, namely an early-harvest and late-harvest stage, against our yield indicator (bean pod weight). Four different spectral data sets were used to investigate whether such oversampled, hyperspectral data sets could accurately and precisely model observed variability in yield, in terms of the coefficient of determination (R2) and root-mean-square error (RMSE). The objective of our approach hinges on the philosophy that selected spectral bands from this study, i.e., those that best explain yield variability, can be downsampled from a hyperspectral system for use in a more cost-effective, operational multispectral sensor. Our results suggested the optimal period for spectral evaluation of snap bean yield is 20 to 25 or 32 days prior to harvest for the early- and late-harvest stages, respectively, with the best model performing at a low RMSE (3.02  g plant  −  1) and a high coefficient of determination (R2  =  0.72). An unmanned aerial systems-mounted, affordable, and wavelength-programmable multispectral imager, with bands corresponding to those identified, could provide a near real-time and reliable yield estimate prior to harvest.


Introduction
It is forecasted that the global population growth will approach ∼24% by the year 2050. 1,2 This notion can profoundly alter social structures, from a society's economic growth, to public health and our interaction with the environment; changes that could diminish employment opportunities, cause agricultural decline, and result in a continuous need for more productive and efficient farms. 3,4 As a result, our ability to use the available croplands, while optimizing yield, will become a requirement for sustainable use of scarce agricultural resources. There exists a range of agricultural crops, but our focus is on snap bean, one of the vegetables that hold a large market *Address all correspondence to Amirhossein Hassanzadeh, E-mail: ah7557@rit.edu share in being the fifth largest vegetable crop nationally in terms of acreage, with 158,920 acres harvested for processing and 71,170 acres harvested for fresh market across the US in 2014, with a combined value of $416 million. 5 Precision agriculture, or site-specific management, is one strategy to help satisfy this need by helping farmers to improve per-acre yield, reduce uncertainty and risk, and optimize the input-output crop yield continuum. 6,7 It is in the context of precision agriculture that remote sensing, a technology that collects spectral and structural information of an object remotely, has garnered attention over the past two decades. 8 One classification of remote sensing systems revolves around the platform, e.g., ground-based, airborne, or spaceborne, with the shared goal of nondestructively obtaining information. 9 This information can be acquired from sensing systems that generate color (RGB) images, three-dimensional (3-D) point clouds from ranging systems [e.g., light detection and ranging (LiDAR)], and spectroradiometers, which collect hyperspectral data, defined as narrow, contiguous spectral bands or wavelengths. 10 The core science behind "hyperspectral" remote sensing and analysis is spectroscopy, an approach to evaluate the interaction of light with materials, introduced by Norris (1965), and the corresponding effect on the material's molecular and chemical structure, called chemometrics. 11 In short, three scenarios can occur when electromagnetic energy strikes a surface: energy can be absorbed, reflected, or transmitted by the surface. 12 Every material thus has a specific reflectivity (ρ), transmissivity (τ), and absorptivity (σ), for which τ þ σ þ ρ ¼ 1 is valid. While each of these scenarios has been studied intensively, the reflected light per wavelength is the most convenient form to examine remotely, given that property's ease of acquisition, inherent informational properties, and the ubiquitous nature of such reflectance-based systems. 13 Most every material is defined by a specific reflectance spectrum, which theoretically varies as the material is altered. This notion becomes crucially important when it comes to living organisms, e.g., crops, which can exhibit spectral variability due to species, physiological state, disease, water stress, growth stage, etc. For example, absorption in the visible region (400 to 700 nm), red edge peak (680 to 760 nm), and reflectance in the near-infrared (NIR) domain (700 to 1000 nm) are primary indicators of chlorophyll absorption, chlorophyll density, intercellular leaf structure, and even the general health of a crop, and serve as a gauge for how welldeveloped, vigorous, and even dense the plant material is. [14][15][16] It is in this context that hyperspectral imaging systems and point-based spectroradiometers have proven valuable, given their ability to measure high spectral resolution from plant surfaces. Spectroradiometers, either imaging or nonimaging (i.e., point-based), represent one class of these hyperspectral sensing systems. Hyperspectral imagers typically collect hundreds of spectral bands, with associated spatial context information (pixel coordinates), often by moving a spectral slit across the subject of interest. 17 Spectroradiometers (nonimaging), on the other hand, obtain an average spectrum within the (fore-)optic's field-of-view (FOV). Hyperspectral devices have been intensively used for a variety of crop-related studies, e.g., harvest maturity quantification, [18][19][20][21][22][23][24][25][26][27] stress and disease detection, [28][29][30][31][32][33][34][35] genus/species classification, [35][36][37][38][39][40][41] and even yield prediction. [42][43][44][45][46][47][48][49][50] We will focus our discussion on the latter, namely yield forecasting or modeling, given its importance to either enable within-season intervention or harvest planning and logistics.
The utility of remote sensing data (e.g., vegetative indices; Ref. 51) to predict agronomic crop data (e.g., yield) is often quantified using empirical univariate or multivariate modeling. For example, Lai et al. 52 evaluated wheat yield as a linear function of time-integrated normalized difference vegetation index (iNDVI) in 17 farms over 15 years in Australia, using leaveone-out cross-validation. Their model was able to predict yield with a low-average RMSE (0.79 Mg ha −1 ). Becker-Reshef et al., 53 in turn, assessed the ability of an empirical model which could estimate yield with a 7% error in winter wheat crop, using MODIS-derived NDVI data from Kansas. Their model was able to forecast yield 6 weeks prior to harvest within a 10% error using MODIS NDVI data in Ukraine. The advantage of empirical modeling is that only a limited number of parameters are required, 54 while pitfalls to this approach include: (i) performance is constrained to the size of the data set; 55 (ii) it is assumed that photosynthetic activity is the primary estimator of the yield, which could result in inaccurate estimations; 53 (iii) these NDVI-based models result in erroneous estimates for crops with high biomass, given that NDVI saturates with high leaf area index; 56 and (iv) this approach is an estimator and arguably not a true forecasting technique. 57 Statistical approaches, as another technique for yield prediction, are defined here as being more multivariate in nature, and we will focus on partial least squares regression (PLSR) 58 as one of the more proven modeling techniques. There has been extensive research encompassing the use of PLSR for crop yield assessment. For instance, the study by Wenzhi et al. 59 evaluated sunflower yield estimation using both PLSR and neural networks on 16 independent physical variables. The authors utilized the variable-in-projection approach for PLSR, which shows how sensitive each variable is concerning the output variable. 60 It was shown that neural networks can capture more complex relationships and slightly outperform PLSR with a high coefficient of determination (R 2 ¼ 0.86, compared to R 2 ¼ 0.77). A study of Vásquez et al. 61 used hyperspectral data (handheld spectroradiometer: 350 to 1050 nm) for paddy field yield estimation via a PLSR approach. The study was cross-validated over six different cultivars, for three different growth stages, and replicated three times. Their results showed that the booting stage exhibits a more accurate estimation than other tests (R 2 ¼ 0.87). It was shown that PLSR can handle data non-normality, with drawbacks such as neglecting variables that may be considered discriminating, as well as being susceptible to the scale of data. We opted to evaluate a PLSR approach to yield forecasting in snap bean, our proxy crop, and did so in a controlled greenhouse environment, before attempting to scale our results to the field-level using unmanned aerial systems (UAS)-based imaging. We hypothesized that snap bean yield can be accurately predicted early in the season via remote sensing techniques, because crop yield has been shown to be a function of time, spectral, and biophysical attributes, among others. 50,[62][63][64][65] The objectives of this study were to: (i) determine the applicability of PLSR for yield assessment of snap bean; (ii) assess the most accurate time to perform snap bean yield estimation, using spectral and plant biophysical data; and (iii) identify spectral regions or features and plant biophysical attributes that result in the most accurate/precise yield predictions for snap bean.

Study Area
Hyperspectral data were collected in an on-campus greenhouse, located at Rochester Institute of Technology (43.0846°N, 77.6743°W), during the winter and spring of 2018/2019. One hundred forty snap bean seeds (cv. Huntington) were planted on March 6, 2019, in a seed raising mix ("Potting Mix," Miracle-Gro, composed of peat, peat moss, perlite, compost, and 0.48% fertilizer), after which 48 plants were selected and transplanted to 6-in.-diameter pots. Plants were irrigated with 500 ml of water every 2 days thereafter.

Plant Growth Characteristics
The 48 plants were divided into two groups of 24 plants, where each plant was considered a single experimental unit. Harvest dates were determined based on when 50% of plants for early harvest, and 70% of plants for late harvest, showed 50% of their pods being at industry sieve size 4 to 6 (see Table 1). On each harvest date, total pod weight of each plant was measured using a balance ("1500" model, VWR, Radnor, Pennsylvania) with a precision of 0.05 g. Sample yield values ranged 16.65 to 38.35 g and 20.20 to 40.10 g for early-harvest and late-harvest stages, respectively.

Data Collection
The spectral data set was captured using a spectroradiometer ("HR-1024i" model, Spectra Vista Corporation, Poughkeepsie, New York), with a 14-deg FOV fore-optic. The spectroradiometer captures spectral information across the visible, NIR, and shortwave-infrared (SWIR) wavelengths (335 to 2500 nm), for 987 contiguous spectral channels. The fore-optic enabled us to accurately target each plant and resulted in a ∼6-cm-diameter sampling size from a distance of 30 cm. This spectroradiometer consists of silicon (350 to 1000 nm), indium gallium arsenide (InGaAs; 1000 to 1890 nm), and extended InGaAs (1890 to 2500 nm) detectors. Select wavelength ranges were omitted from the sampled spectra prior to analysis: (i) range 335 to 480 nm, due to sensor fall-off noise; (ii) the 850-to 1000-nm range, because of artifacts due to detector overlap between the Si and InGaAs sampled regions; (iii) the 1900-to 2000-nm range, due to a similar overlap artifact between the InGaAs and extended InGaAs detectors; and (iv) the region between 2400 and 2500 nm, again due to low SNR (detector fall-off) in the far SWIR region. This selective data cleaning resulted in 678 remaining spectral bands. The spectroradiometer was mounted on a stationary stand and was covered with black felt to inhibit external light from entering the sampling chamber. Two tungsten-halogen lamps were installed inside the chamber to produce stable and consistent light for measurements, shown in Fig. 1. All samples were calibrated to reflectance using a Spectralon calibration panel as the reference target, shown in Fig. 1; this was done to account for potential illumination differences or sensor drift between samples, either within-or across sample days. 66 Finally, to minimize the influence of potential "mixed pixels," i.e., where sample background (potting mix) could influence the signal collected for each target, a collar-shaped nonreflective black felt was used to cover the potting mix (see Fig. 1); this approach enabled the capture of a pure vegetative spectrum for each sample (one spectral sample per plant) at the canopy-level, 15 to 20 cm away from the canopy top, resulting in a 3-to 4-cm sampling diameter. Averaging of canopy-level reflectance resulted in a closer approximation of what a UAS-based spectrometer would sample, even if across canopy scale.
Spectral and biophysical measurements, such as the number of leaves, plant height, and canopy width, were collected three times a week and then daily for the last 10 days of the growth Fig. 1 (a) Black felt was used to reduce spectral mixture effects, where potential potting mix background may erroneously be included on the plant spectrum; and (b) Spectralon calibration panel for reference measurements, used for calibrating radiance data to reflectance and thus normalizing for any illumination variability that may have occurred between samples and sample days. season, throughout the 69-day growing period. The number of leaves per plant was recorded, and plant height and canopy width were measured using a ruler. We normalized the collected spectral and biophysical attributes to the scale of 0 to 1, ensuring comparable feature influence on the final analysis.

Data Analysis
The spectral data were subjected to several preprocessing steps before being used in yield prediction algorithms. First, the data set was tested for multivariate normality using Small's and Chi-squared approaches to evaluate the data distribution. 67 Both approaches utilize the squared Mahalanobis distance between each sample vector and the mean feature vector for the data set X M×N (M number of samples and N number of features) as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 5 9 1 where i ¼ 1; : : : ; m, and S −1 is the inverse covariance matrix of random variable X, along the feature axis. The difference between these two approaches is that the Chi-squared method compares the squared Mahalanobis distance to percentiles from the Chi-squared distribution, which is limited by a requirement of large sample sizes for a large number of features, while Small's method estimates this by calculating u values, and compares sorted u values with percentiles from a beta distribution, as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 4 8 6 The next preprocessing step involved spectral or mathematical enhancement (transformation) of spectral data, including smoothing, first derivative calculations, and absorption feature enhancement. Spectral smoothing was performed using a Savitzky-Golay filter with a window size of 11 bands, fitted with a second-order polynomial. 68 This configuration was chosen based on a preliminary sensitivity analysis using the raw data. 3 First derivatives were calculated, via a Savitzky-Golay filter with the second-order polynomial fit over 11 bands, in order to evaluate not only spectral magnitude but also curve shape, as well as the baseline offset reduction due to scattering. 25 Absorption feature enhancement used a continuum-removal (CR) approach for spectral regions 555 to 755 nm (potential chlorophyll absorption), 1120 to 1265 nm (potential water and cellulose absorption), 1275 to 1675 nm (water, sugar, protein, and nitrogen absorption), 1676 to 1825 nm (lignin, cellulose, and sugar absorption), and 1900 to 2397 nm (structural signatures related to cellulose and lignin). 69 Consequently, these spectral data sets (viz., raw spectra, smoothed spectra, first derivative spectra, and CR-spectra) were appended with concurrent physical data and used as inputs to the yield prediction analysis.
Normalization using scaling ensures that features with different units of measure have the influence on the analysis. 70 We utilized this approach due to the difference in scales between biophysical and spectral data and scaled all features to the 0 to 1 range.
Finally, the PLSR approach was utilized for yield prediction. 58 PLSR projects the data onto an alternative axis/dimension using latent variables (LV), maximizing the covariance between independent variables and the dependent variable. In PLSR, X (the independent variables) and Y (the dependent variable) can be explained as a sum of LVs as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 1 8 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 1 4 1 where P and Q are the loading factor matrices, T and U are called the score matrices, and E and F are the residual matrices. PLSR also outputs a coefficient matrix, C, that explains the weight of each independent variable in the simple linear regression model, as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 7 3 5 Finally, the band selection stage for this study can be defined as the weight of each band (matrix C), which can be used to identify features that have more discriminating power over other features. The goal is to be able to identify features that contribute most to the modeling effort and best explain yield as a function of several independent feature variables. 32,71 For each data set (viz., raw, continuum-removed, smoothed, and first derivative), we evaluated the PLSR-based coefficient of determination (R 2 ), adjusted coefficient of determination (R 2 adj ), and RMSE of calibration and leave-one-out cross-validated sets, 72 based on the number of components used (i.e., maximum of five components). Consequently, regression results of the number of components corresponding to highest R 2 values for calibration and leave-one-out cross-validation sets were reported. Python 3.5 73 was used for all preprocessing and regression tasks. Scikit-learn module 74 was used for PLSR, and author's publicly available codes were used for normality assessment and CR (available at https://github.com/yxoos/MultivariateNormality and https://github.com/yxoos/ ContinuumRemoval).

Results
The weighted average growth rate of snap bean plants ranged from −0.38 to 1.15 cm and −0.40 to 0.65 cm per day of data collection, for early-and late-harvest stages, respectively. Figure 2 shows snap bean biophysical changes including canopy width, plant height, and the number of leaves, over the growth cycle. As can be seen from the two stages [Figs. 2(a) and 2(b)], the variability in canopy width is substantial, while plant height and number of leaves follow a parabolic trend. Specifically, width measurements exhibited more noise than the other two attributes; this might be due to plants' response to the amount of sunlight they receive and available natural light variability throughout the growth period. Maximum plant height was observed between 45 and 55 days after planting (DAP), or between 15 and 25 days prior to harvest (DPH) in the number of leaves curve, for both stages. After that period, the snap bean plants divert more resources to pod formation and maturation, which results in leaf abscission and chlorosis of the foliage [see Figs. 2(a) and 2(b) for number of leaves curve at DAP > 60].
Results from multivariate normality tests, presented in Fig. 3, show p-values that confirm multivariate normality with only minor indications of non-normality. Figure 4 shows the R 2 results as a function of time, while Table 2 tabulates results (R 2 , R 2 adj , and RMSE) of the best model for each transformation approach. Each subfigure shows the best-fit value attained with five LV and the maximum R 2 for the number of LV smaller than five. We can see that the fit values comparing the trends from fit values with those of cross-validation. That is why cross-validation plays a crucial role in proposing a generalized model, i.e., it enables a more accurate prediction response when extrapolated to new datasets. 52 According to Fig. 4, almost all models exhibit local maxima around 33 to 35, 42 to 44, 55 to 57 DAP, and a few around 62 DAP. We can observe a peak around 33 to 35 DAP, which has a lower accuracy when compared to other time frames and only shows superior performance in [Raw-spec. + PHY] and [Smoothed-spec. + PHY], while not being consistent across all transformations (see Table 2). Results from this table also show that raw and smoothed models performed similarly, which are indicative of the relatively low noise level in the acquired data. Moreover, the maximum cross-validation R 2 value at 55 to 57 DAP is considered to be for a period when it is too late to take action, management-wise. The same is valid for 62 DAP, which is at 3 DPH. The maximum cross-validation R 2 value around 42 to 44 DAP (or 23 to 21 DPH) seems to be an excellent temporal assessment period, which arguably is timely in terms of potential management interventions, while also being consistent across transformations in terms of performance. We also observed that the addition of physical attributes increased R 2 (up to 20%) across the growth period, along with increasing R 2 around 62 DAP (or 3 DPH). The model that performed best during that period is [smoothed-spec. + PHY]. However, both CR models exhibited superior performance (R 2 ¼ 0.40) and remained as accurate when biophysical predictor features were removed (R 2 ¼ 0.31 for [CR-spec.], R 2 ¼ 0.33 for [CR-spec. + PHY]). Figure 5 shows the residual and regression plots for the two models that performed best, i.e., [Raw-spec. + PHY] and [CR-spec.], as reported in Table 2. As presented in Table 2, the residual distribution was random for both approaches and did not follow any trend, which confirms that the dependent-independent variable modeling approach properly described the trend or behavior in the variable relationship. Two wavelength regions (1030 to 1075 nm NIR and 1678 to 1725 nm SWIR) were dominant for yield prediction.

Late-Harvest Stage
Results for all four spectral data sets are shown in Fig. 6; this figure shows that several local maxima occur around 37 DAP and 44 to 47 DAP. Correspondingly, the highest accuracy measures are presented in Table 3. It is likely that the 37 DAP exhibits a maximum due to biophysical features, since the performance would deteriorate were it not for these biophysical characteristics. It is also interesting to note that, while significant variability in observed yield was explained at 37 DAP, the R 2 value decreases to 60% of its original value if biophysical data are omitted. However, the [CR-spec.] data do not rely on biophysical data at all and exhibits similar performance to the [Smoothed-spec. + PHY] model, which indicates that this transformation could be more robust.
Multiple wavelength ranges contributed to this analysis in a more general sense, as presented in Table 3. These ranges were 483 to 495 nm, 670 to 680 nm, 748 to 752 nm, 2090 to 2110 nm, Fig. 3 (a) Small's multivariate normality test. Note that the data follow the non-normal trend to some extent; and (b) the Chi-squared multivariate normality test. Fig. 4 The early-harvest stage results: (a)-(d) spectral-only results for the raw data, smoothed, first derivative, and CR approaches, (e)-(h) the stage outcomes using for spectral plus biophysical features for the raw data, smoothed, first derivative, and continuum-removed data sets. 2273 to 2287 nm, and 2380 to 2395 nm. In the next section, the correlations between these spectral features and plant chemical and physiological changes are explained. We depicted the top two models in Fig. 7; we elected to show the results for both the [CRspec.] and [Raw-spec. + PHY] models, since the former is not as reliant on biophysical features, and the latter outperformed the other models. As can be seen, findings show high coefficients of determination for the raw and CR approach (R 2 ¼ 0.67 and R 2 adj ¼ 0.58 for raw data, R 2 ¼ 0.58 and R 2 adj ¼ 0.63 for continuum-removed data). Also, raw data exhibited a low RMSE (0.265 g plant −1 ), 25% lower than the best result in the early-harvest stage, shown in Fig. 7(b). Again, the distribution of residuals has a random nature to them [see Figs. 7(a) and 7(c)], which supports the conclusion that the algorithm or model captured the trend in the dependent variable.

Discussion
Yield prediction of snap bean via hyperspectral remote sensing techniques in a greenhouse environment was evaluated and proved satisfactory in terms of the explained variability in the dependent (yield) variable. Our findings show that 20 to 25 and 32 DPH represent the optimum time periods for accurate snap bean yield prediction, with high cross-validated coefficients of determination (R 2 val ¼ 0.40 to 0.72) and low RMSE (3.02 to 4.51 g plant −1 ). Moreover, our results demonstrate that the snap bean yield prediction can be improved by the addition of biophysical data (viz., canopy width, height, and the number of leaves) to spectral predictor features.   6 The late-harvest stage results: (a)-(d) the spectral only results for raw data, smoothed, first derivative, and CR approaches, (e)-(h) spectral plus biophysical features included in the for raw data, smoothed, first derivative, and CR stages. Note: The bold rows indicate the best models in Fig. 7.
Our yield prediction analysis shows that that continuum-removed data and raw data outperformed other spectral datasets. The early-harvest stage included spectral features in two wavelength regions, namely the 1030-to 1075-nm NIR and 1678-to 1725-nm SWIR regions. A closer look at these spectral domains and their correlations to plant physiology explains that: 1. The 1030-to 1075-nm region is tied to protein and oil absorption, and their corresponding N─H stretch and C─H stretch/deformation. 75 2. The 1678-to 1725-nm region, on the other hand, has been shown by several studies to be coupled to lignin and nitrogen absorption. 16,76,77 Similarly, the late-harvest stage included six wavelength regions of significance to yield prediction, namely 483 to 495 nm (blue), 670 to 680 nm (red), 748 to 752 nm (red edge), and SWIR regions 2090 to 2110 nm, 2273 to 2287 nm, and 2380 to 2395 nm. The corresponding link between the identified spectral domains and plant physiology were scrutinized in the literature: 1. The 483-to 495-nm region suggests strong absorption for chlorophyll-a, and -b. 75 However, this region is also susceptible to noise, due to atmospheric scattering, which results in correspondingly low SNR for many remote sensing sensors, should these results be extended to airborne platforms. 75 2. The 670-to 680-nm range is responsible for chlorophyll-a absorption and its corresponding electron transition. 75 The mentioned rationale makes sense since, at that stage (early pod formation), the plant accumulates more chlorophyll, as required for photosynthesis and to store energy to turn small pods (i.e., pins) into mature ones. Also, we noticed that as the plant transitions through pod maturity, pod formation is mainly dependent on sunlight. In other words, if enough water and sunlight are available to the plant, pods will remain on the plant, hence higher yield, while pod excision will occur if resources become scarce. 78,79 3. The 748-to 752-nm region contains the well-known red edge feature, 80 which is tied to the general chlorophyll level in the plant and its associated impact on plant vigor/ health. 16 4. Even though there is a decrease in the SNR ratio in the third detector (extended InGaAs), the range between 2090 and 2110 nm serves as a reliable indicator of starch, cellulose, and nitrogen absorption features, which vibrates O═H, C─O, and C─O─C bands. 75,76 5. The 2273-to 2287-nm region is mainly governed by the C─H, O─H, and CH 2 stretch for cellulose, sugar, starch, lignin, and nitrogen. 75,77 Note that cellulose and lignin indicate more structural support for pod formation, which mirrors the stage during which the analysis occurred (44 DAP). Sugar and starch absorption are essential factors in pod formation, since almost 72% of snap bean dry weight is composed of these two elements. 81 6. Finally, limited information exists in the literature regarding the 2380-to 2395-nm region; this lack of information could be due to this region often exhibiting low SNR ratios, due to the SWIR detector fall-off and atmospheric absorption effects. 75 This could suggest an area of further research that focuses on the link between potential structural absorption features in the higher end of SWIR region and plant physiology.
The identified spectral domains and their corresponding link to plant physiology, for both early-and late-harvest stages, are tabulated in Table 4. These identified spectral features/ranges can be used to distill the oversampled hyperspectral system to more operational, cost-effective multispectral sensing solutions. For example, typical UAS-based imaging spectrometers range between $60,000 (NIR) and $170,000 (SWIR), involve complex system calibration steps, require extensive in-flight calibration to reflectance, and have to be flown on expensive airframes, given the payload weight. A multispectral solution, on the other hand, which retains only the above-mentioned spectral features, potentially would be much cheaper, require less preprocessing, and can be housed on more accessible UAS platforms. The mentioned downsampled multispectral solution ideally then would have spectral bands that correspond to the spectral features identified in this study. For example, a prediction aimed at the late-harvest stage would include a drone flight at 25 DPH, with the five narrow-band spectral regions mentioned in Table 4. This effectively could reduce the cost of UAS "yield modeling" platform by an order of magnitude, when compared to spectrometers. These results also can be compared to other studies, even though the scales may differ.
For example, crop yield prediction generally falls into two different categories in terms of sensing platform, namely airborne-and satellite-based. Studies focusing on satellite-based crop yield prediction (e.g., Refs. 52, 53, 82 and 83) have proven accurate estimates, but with the drawback of coarse spatial scales, often at no better than the field-level scale. This negates the use of satellite-based assessments for precision agriculture purposes. 84 This is where low-cost, operational airborne crop yield prediction becomes attractive (e.g., Refs. [85][86][87][88][89]. However, airborne-level spectral yield assessment has its own drawbacks, namely that these systems (i) are inherently noisy, (ii) can adversely be affected by changing illumination conditions, (iii) are lowcost in general, but may become computationally and financially expensive when collecting high temporal resolution data, and (iv) mostly include preidentified spectral bands not selected for the specific application. Most of these limitations are addressed in our study, i.e., a best-case scenario which identifies spectral bands, downsampled from a hyperspectral system, in lownoise, stable illumination conditions, and at a high temporal resolution, to identify the ideal wavelengths and most accurate timing for yield prediction. Future efforts could focus on translating these concepts to a multispectral setup (see next section).

Conclusions
Timely and accurate/precise yield assessment of crops can benefit farmers via optimization of management inputs for maximized profit, but also maximize the actual yield for market consumption, and can contribute to doing so on a sustainable basis. Our study focused on modeling pod weight, as a yield indicator of plants, using hyperspectral and biophysical data. Our approach hinged on PLSR for yield prediction of snap bean (cv. Huntington) at two different harvest times, namely early-and late-harvest, defined as when 50% of plants for early harvest and 70% of plants for late harvest, showed 50% of their pods being at industry sieve size 4 to 6.
Our early-harvest and late-harvest stage approaches confirmed that 20 to 25 DPH is when spectral signatures surface that result in a high accuracy for yield forecasting. Biophysical features proved to be critical in improving the performance of models. Also, the depicted growth trends in biophysical features, and their corresponding maxima around the same period (20 to 25 DPH), corroborated our results based on spectral-only data for the ideal time period to apply yield assessments. We demonstrated that, if using spectral-only data, the CR model, and if incorporating biophysical data with spectral information, the raw or smoothed transformations, should be the preprocessing method of choice. Our spectral model was able to explain ∼72% of the variability in observed yield values (R 2 ¼ 0.72), which bodes well for extending these results to other, more operational scenarios. In terms of spectral features, the NIR edge peak, chlorophyll absorption features in the visible domain, protein and oil absorption features in the early-SWIR region (∼1000 nm), lignin and nitrogen absorption peaks in the ∼1700 nm range, and sugar and structural absorption features (e.g., starch, lignin, cellulose, protein, nitrogen, etc.) in the ∼2100 and ∼2280 nm regions proved to be especially critical to robust modeling efforts. Our study represents a best-case scenario, from a spectral perspective, where the illumination conditions were optimal and controlled during data collection. As such the SNR remained relatively high even at the tail-ends of the detectors' spectral response curves, but would deteriorate in more typical, outdoor conditions, especially as more atmosphere (altitude) is introduced. That led to one of our conclusions, i.e., that while these models showed promising results and could be extensible to more operational platforms, the next phases of the research should focus exactly on that, i.e., scaling of results from a greenhouse scenario to UAS or other airborne modalities. Also, results from this study imply that for future work on the use of structural sensing systems, such as LiDAR and 3-D structure-from-motion approaches, could augment spectral-based models by including estimates of plant biophysical attributes. Finally, implementation of crop growth models, or models that aim to enhance statistical model accuracy by adding more auxiliary environmental variables, such as soil characteristics, geographical data, and other crop attributes, could also be an area of interest for future work.
Sean Patrick Murphy is a technician in plant pathology at Cornell AgriTech, Cornell University, Geneva, New York, USA, and focuses on diseases of broad-acre vegetable crops, such as table beets, snap beans, dry beans, and allium crops. He joined Cornell University in 2011 in the Horticulture Department working on peas, beans, and sweet corn before moving to plant pathology in 2017. He received his bachelor of technology degree in horticulture business management from SUNY Morrisville in 2016.
Sarah Jane Pethybridge is an associate professor of plant pathology at Cornell AgriTech, Cornell University, Geneva, New York, USA, and focuses on diseases of broad-acre vegetable crops. She specializes in vegetable disease epidemiology and management. She joined Cornell University in 2014 after roles in academia, industry, and government in Australia and New Zealand. She received her bachelor of agricultural science with first class honors and PhD in plant pathology in 1995 and 2000, respectively.