A machine learning approach to rural entrepreneurship
Abstract
enThis study offers a novel approach to understand the mechanisms of rural entrepreneurship by applying five alternative machine learning techniques on data obtained from the Life in Transition Survey III. Results highlight how capital constraints, age, factors related to trust and over-trust, awareness of current trends, the use of various media tools, a competitive character, institutional factors, and education are associated with the success and failure of potential entrepreneurs in rural areas who attempt to set up a business. The final predictions are achieved with accuracies ranging from seventy-two to ninety-two percent.
Resumen
esEste estudio ofrece un enfoque novedoso para entender los mecanismos de las actividades de creación de empresas en el medio rural mediante la aplicación de cinco técnicas alternativas de aprendizaje automático sobre datos obtenidos de la Encuesta de Vida en Transición III. Los resultados ponen de manifiesto cómo las limitaciones de capital, la edad, los factores relacionados con la confianza y el exceso de confianza, el conocimiento de las tendencias actuales, el uso de diversas herramientas de comunicación, el carácter competitivo, los factores institucionales y la educación están asociados con el éxito y el fracaso de los posibles empresarios de las zonas rurales que tratan de crear una empresa. Las predicciones finales se consiguen con precisiones que van desde el setenta y dos al noventa y dos por ciento.
抄録
ja本稿では、Life in Transition Survey IIIから得られたデータに5つの機械学習技術を適用した、農村におけるアントレプレナーシップのメカニズムを理解する新規のアプローチを提示する。結果から、資本制約、年齢、信用と信用過剰に関連する要因、最新の傾向の認識、様々なメディアツールの使用、競争的性格、制度的要因、学歴、の以上が起業を考える農村地域における潜在的起業家の成功と失敗にどのように関連しているかが明確に示される。最終予測は72~92%の範囲の精度で的中した。
1 INTRODUCTION
“…the creation of a new organization that introduces a new product, serves or creates a new market, or utilizes a new technology in a rural environment” (Wortman, 1990).
The term “rural environment” articulated in the preceding definition deserves emphasis. A rural environment can embody distinct impediments to entrepreneurship, for example, the rift between rural areas and major urban markets impose accessibility constraints on potential entrepreneurs (Stathopoulou et al., 2004). Nevertheless, in the face of challenges, rural entrepreneurs are key players for rural revitalization and economic development (Gladwin et al., 1989; Wortman, 1990). The recognition of this fact and pursuing the aim to foster economic growth and competitiveness in rural areas has led to practical policy applications in Europe in the last decades (Baumgartner et al., 2013). Entrepreneurial activities in rural areas can nurture innovation, create job opportunities, and make the best use of the available limited resources (Fuller-Love et al., 2006; Newbery et al., 2017). Baumgartner et al. (2013) place the rural entrepreneur into the lead role in their conceptual model, and argue that entrepreneurship gives rise to positive social, economic, and ecological outcomes in rural areas. However, despite the recognition of the potentials of rural entrepreneurship, small and new businesses in rural areas frequently fail (Deller & Conroy, 2016). The high failure rate of rural businesses has been persistent since the early 1990's even in developed countries, and the need to understand more about the personal attributes and behavior patterns that lead to successful entrepreneurial outcomes in rural communities has often been brought to attention (Gladwin et al., 1989).
In relation to business hindrances pertaining to rurality, the necessity of designing and improving policies to support the entrepreneurs in rural areas has been underlined, particularly concerning the European context (North & Smallbone, 2006; Stathopoulou et al., 2004). Alongside with accessibility constraints, significant other impediments to entrepreneurship are commonly ingrained in rural areas. These disadvantages include, among others, an aging population (Pato & Teixeira, 2014), absence of sufficient education and health services (Lehmann et al., 2008), being female (Bock, 2004; Pagan, 2002), and being of young age as opposed to older entrepreneurs who are likely to have a higher reputation in a rural context (Meccheri & Pelloni, 2006). Furthermore, in general, individuals in rural labor markets are characterized by lower education levels and skills (Korsgaard et al., 2015).
In addition to the above examples of challenges that rural environments pose for entrepreneurs, it has also been shown that certain demographic groups face disadvantages specific to rural locations. In rural Europe, unemployment has been observed as an enduring obstacle in particular for young job-seekers (Philip & Shucksmith, 2003; Unay-Gailhard, 2016). On top of that, it is often observed that the earlier mentioned insufficiency of education services in rural areas is particularly detrimental to the employment opportunities of women and young people (Bock, 2004; Cartmel & Furlong, 2000; Chandler, 1989). For the latter group, even finding accommodation and accessing social networks that can potentially result in job placement may be a problem (Hoggart & Cheng, 2006; Lindsay et al., 2003; Rugg & Jones, 1999). Such obstacles are important; in most cases jobs in rural areas are distributed through word-of-mouth (McQuaid et al., 2004), and informal networks play a significant part in entrepreneurial decisions (Karadeniz & Özçam, 2010). Women face additional disadvantages resulting from social customs particular to rural settings (Maru, 2016) leading to an increase in frequently unpaid workload in agriculture (Olhan, 2011). Young individuals on the other hand, often respond to the rural specific challenges by simply moving away from these areas and relocating to urban locations (Cartmel & Furlong, 2000; Jones, 2004). Their emigration alters the demographic characteristics of rural areas as the old and vulnerable individuals account for most of those who are left behind (Lyu et al., 2019). Such disadvantages that are particular to rural settings can be mitigated – if not eliminated – by fostering entrepreneurial activities which, in turn, can play an important part in the revitalization and development of rural areas. On these grounds, the present study aims to shed light on the individual-level mechanisms behind successful and unsuccessful entrepreneurial outcomes in rural areas through the use of modern algorithmic methods that are suitable for processing datasets that contain large numbers of features, namely, tree-based machine learning (ML) models.
The remainder of this study is organized as follows. Section 2 discusses how machine learning approaches are suitable techniques to address research questions on entrepreneurship. Section 3 describes the data and the source survey. Section 4 outlines a base classification tree and discusses its findings. Section 5 extends the analysis to ensemble techniques and elaborates on the findings of a bootstrap aggregated model alongside those of a random forest approach. Section 6 discusses the results of sequential additive tree models; namely, gradient boosting and stochastic gradient boosting machines. Section 7 presents the concluding discussion.
2 THE BENEFITS OF MACHINE LEARNING METHODS IN VIEW OF THE COMPLEXITY OF ENTREPRENEURSHIP
Substantial socioeconomic complexities, and non-linearities among relationships in the area of rural entrepreneurship are evident in the earlier discussions present in the literature. For instance, Stathopoulou et al. (2004) discusses how rurality affects entrepreneurship during the conception, realization, and operation stages. According to the authors, entrepreneurship is affected differently in each of these stages in an interactive and dynamic way, subject to the economic and social environment within which the entrepreneur is located. The authors also highlight that individual attributes related to identifying economic opportunities, taking risks, and adapting to new entrepreneurial objectives are strongly subject to rurality. In other words, the behavior and success of the rural entrepreneur can only be fully understood in relation to the institutional and social context of the rural area in which she/he attempts to operate (Stathopoulou et al., 2004). Such interactive and dynamic mechanisms that give rise to non-linear relationships underlying entrepreneurial outcomes have been emphasized by other researchers as well. For instance, Anderson et al. (2012) conceptualize entrepreneurship as a complex adaptive system, and highlight that examining the components of entrepreneurship in isolation from each other in a reductionist manner will result in failure to understand novel and difficult-to-predict patterns, which in turn leads to the misinterpretation of the causal mechanisms underlying the interconnected wholeness of entrepreneurship. In the entrepreneurship literature, the idea that entrepreneurial activities are “complex social problems,” subject to dynamic properties of non-linear network feedback systems that are very difficult to predict is firmly established (Bruyat & Julien, 2001; Dorado & Ventresca, 2013). The complex interdependencies that exist among individual level attributes in relation to entrepreneurship has also been highlighted by Acs et al. (2008) who discuss how the nexus linking entrepreneurship, institutions, and economic development is a critical and complicated policy interest. Machine learning methods present very useful tools to address precisely such complexities thanks to their ability to consider all available data and all potential interactions and non-linear forms. They are particularly suitable for addressing research questions that deal with exceedingly complicated relationships among variables (Varian, 2014). As discussed above, such complexities are strongly pronounced in relation to socioeconomic relationships (Nijkamp et al., 2001). Focusing on the social context in particular reveals some further examples on how social complexities are sources of a large degree of sophistication. For instance, as noted by Korsgaard et al. (2015), even the lifestyle choice of the entrepreneur, and the resulting espousement to a specific location may play a role in entrepreneurial outcomes. Lifestyle related entrepreneurial choices, when supported by the presence of IT and other types of infrastructure, may even outweigh (but not offset) profitability concerns. Such personal choices are particularly observable in the context of rural tourism (Getz & Carlsen, 2000; McGehee & Kim, 2004).
In conjunction with the discussion on lifestyle choices, findings on social relations and social capital in the area of entrepreneurship has led to the development of the concept of “embeddedness” (Granovetter, 1985). More specifically, trust and solidarity in social networks and the rural entrepreneur's ability to draw support from them is often noted as an important determinant of entrepreneurial activity (Akgün et al., 2010; Portes, 1998).
As earlier mentioned, rural entrepreneurial choices are subject to the availability of infrastructure. However, in a rural context, entrepreneurship-related infrastructure is not limited to roads, internet services or the like. Neglected or empty buildings like former castles, unused slaughterhouses, properties and landscapes associated with legends or history (i.e. structures offering “immaterial resources”) are frequently turned to commercial advantage by entrepreneurs (Müller & Korsgaard, 2017). In addition to such advantages, the physical setting itself, that is to say, the nature and the landscape are often potential assets for the rural entrepreneur (Muñoz & Kimmitt, 2019). Moreover, rural locations present a wide variety of opportunities to entrepreneurs arising from the demand for amenities, recreation, and quality food products, as well as with light industry merchandise (Stathopoulou et al., 2004). Accordingly, the methods applied in this study are based on the expectation that the dynamic and non-linear socioeconomic systems underlying entrepreneurial outcomes are highly sophisticated. Therefore, as an alternative to conventional theoretical modeling decisions, ML approaches that may shed new light on the topic by presenting new avenues and methods of analytical exploration are used. The potential contribution of these new statistical techniques is based on their high performance in out-of sample prediction based on large data sets, which makes them particularly relevant in cases where a wide range of explanatory variables and highly interactive relationships are present (Athey, 2018; Harding & Hersh, 2018; Mullainathan & Spiess, 2017).
The methods used in the present study are tree-based models as part of modern ML techniques. This is a main feature of the study that contributes to its novelty, as applications of machine learning techniques on the topic of entrepreneurship are very rare. 11 While there exist a small number of ML studies that are related to the concept of entrepreneurship, they often do not address entrepreneurial outcomes directly, but approach the topic mostly in the form of predicting entrepreneurial intentions and orientations of individuals and establishments (see the discussion by Sabahi & Parast, 2020). Two recent comprehensive studies that examine the topic are by Montebruno et al. (2020) and Sabahi and Parast (2020). Montebruno et al. (2020) classify individuals whose entrepreneurial status were unregistered in the British censuses through the period 1851-1881 into the categories of “entrepreneur” and “non-entrepreneur” (or “worker”) based on training on newer census data using a wide range of ML algorithms. The authors demonstrate how ML algorithms outperform traditional classification methods in the context of a research question similar to that of the present study, albeit with a much smaller number of features. This being said, the historical data used by Montebruno et al. (2020) contains information related to occupation/industry which clearly can improve predictions when used predicting entrepreneurial success. While this information is also present in LiTS III (for the respondent and the parents), the large number of missing observations for our restricted sample renders it unusable. 22 This variable causes 502 observations to be lost, and an important portion of the remaining observations fall under the “Nonclassifiable Establishments” and “Don't know” categories. Sabahi and Parast (2020), on the other hand, take a different approach and establish a clear connection between project performance and individual entrepreneurial orientation, by considering entrepreneurship as an explanatory feature is a series of ML algorithms among other predictors. Clearly, studies by Montebruno et al. (2020) and Sabahi and Parast (2020) are important illustrations on the suitability of ML methods to understand the underlying mechanisms of entrepreneurial activity, particularly given the high chance that the outcome is subject to “ambiguous functional forms” (Gu et al., 2018; Sabahi & Parast, 2020). Nevertheless, despite the existence of such important contributions, the scarcity of the application of ML methods is also apparent in the area of economic research as a whole (Athey & Imbens, 2019).
3 THE DATA

For the purpose of reducing unobserved heterogeneity that may be present as a result of large differences in institutional systems, the data set has been limited to the EU countries that are covered in LiTS III. As a result, the remaining countries in the data set are: Bulgaria, Croatia, Cyprus, Czechia, Estonia, Germany, Greece, Hungary, Italy, Latvia, Lithuania, Poland, Romania, Slovakia, and Slovenia.
The above subsetting steps reduced the original LiTS III data set to a sub-sample consisting of 9,389 observations and 1,288 features. However, this reduced data set included a large number of variables with relatively very few observations, reducing the data with complete cases to a length of zero. A sequential (step-wise) approach was taken to identify the features – and the combinations of multiple features – responsible of rendering the data set unusable. At each iteration, the feature or feature combination causing the highest loss of complete cases was dropped. The LiTS III data also includes a large number of administrative variables (e.g. duration of the interview, interview time, etc.), in addition to rows that correspond to refusals to answer. These observations were also excluded. Finally, each class in every categorical feature with a large number of classes was coded as a binary variable. The resulting final sample used in all ML models consists of observations on 642 potential entrepreneurs and 233 features.
Due to the large number of features, only those that are selected by the ML models are defined within this study. The algorithmic feature selections are made from 233 features that define personal demographic traits such as age, gender, ethnicity, alongside variables related to starting a business, borrowing funds, taking risks, trust in various groups and institutions, the mode of obtaining news and information, membership in groups and unions, political activity, views on corruption, etc. The full LiTS III data and the code book on all variables is available for download in the website of the EBRD.
4 RECURSIVE BINARY PARTITIONING
and
is equal to
1(yi = c) where yi is the outcome for the i'th entrepreneur in the training data (
), c indexes the classes, and Rm is the set of observations that fall into the m'th node with Nm being the corresponding number of observations.
55
As there are two classes of the dependent variable, Gm is simply
. ,
66
The partitioning process of the data into distinct regions by a classification tree and the correspondence of those regions to tree leaves is further discussed and visualized in the Appendix. A small value of Gm means that the node m is dominated by observations belonging to a single class while
indicates that all observations in m are of the same class (Géron, 2019; James et al., 2013). At each stage of the binary partitioning process, a splitting feature xk from the predictor space with
and its splitting value s are selected such that they solve:
(1)To prevent overfitting, the tree is pruned by searching – through a 10-fold cross-validation procedure – for the optimum value for the complexity parameter α that yields, on average, the minimum value for
where
1(yi ≠ c∗),
is the index for the terminal nodes of the corresponding subtree t (
and
),
and
are the set and number of observations within
respectively, and the majority class in the terminal node is denoted by c∗ (Friedman, 2001; James et al., 2013; Sutton, 2005). Furthermore, a grid search is used for identifying the minimum number of observations in a node, and the maximum node depth.
88
Both parameters were searched in a range of 2 to 20 by an increment of 1. These depth and observation number parameters, together with the complexity parameter, aim to avoid overfitting by allowing for predictions generalizable to out-of-sample cases. If an unrestricted (i.e., unpruned) tree is constructed, the tree and its splits would lead to overprecise predictions exclusive to the training data set. However, rebuilding the tree based on a different group of individuals would likely lead to very different predictions regarding the success outcomes of the out-of-sample potential entrepreneurs (the former data would perform poorly in predicting the latter). Pruning the tree through these parameters, while diminishing this problem, does not eliminate it. Hence the usage of ensemble models in the following sections.
Based on the above outlined steps, a classification tree where
0.01 with a minimum split value of 6 and a maximum depth of 5 is generated. The resulting tree, presented in Figure 1, has an accuracy of 72% on the test data. The features selected by the tree are listed in Table 1 which also shows the class breakdown and distribution for the categorical variables. The descriptive statistics are shown in Table 2.
99
For reference, the diagram of the unpruned version of this tree is presented and discussed in Appendix A.2. Each leaf (terminal region) in the figure lists the predicted class for that region, as well as the distribution of the YES and NO categories respectively, and the percentage of observations that fall into that leaf. The resulting classification tree, as discussed below, clearly hints to a highly non-linear relationship between the predictors and success in setting up a business in a rural context. Unlike, say, an econometric approach which would attempt a one-shot estimation using the full data, the ML models allow for all functional forms and search for the best parameters through resampling the data using cross-validation and/or bootstrapping techniques. While the former approach would require a subjective modelling and selection from the 233 features, the ML models are following strictly algorithmic steps in uncovering a potentially ambiguous functional form, while still being grounded in theoretical underpinnings (e.g., specification of a dependent versus explanatory variables based on the research question, and the selection of a theoretically relevant data set).

| Name | Description | Values |
|---|---|---|
| Czechia | Binary variable indicating the country | Equals one if the respondent is |
| of the respondent as Czechia | based in the country (163) and zero otherwise (479). | |
| Demonstration | Score measuring how likely | Categorical variable where 1 indicates |
| the respondent is to attend | the respondent has already attended at least to one (123), 2 | |
| a lawful demonstration. | indicates the respondent may attend (277), and 3 | |
| indicates that the respondent | ||
| would never attend (242). | ||
| OnlineNews | A 1 to 7 scale measuring the frequency | 1 indicates “never” and 7 indicates “daily.” |
| of the usage of the internet and email for | ||
| following the global and countrywide developments. | ||
| RiskTaker | Willingness of the individual to take risks. | A 1 to 10 scale where |
| 1 indicates “not willing at | ||
| all” and 10 indicates “very much willing.” | ||
| SuccessLoan | Variable categorizing successful | 1: Succeeded (149), 2: Did not |
| and unsuccessful attempts to borrow funds | succeed (55), 3: Did not | |
| for starting a business. | try to borrow (438). | |
| TrustBanks | The degree of trust the respondent | 1 indicates complete |
| has in banks and the financial system (scale 1 to 5). | distrust and 5 indicates complete trust. | |
| TrustParl | The degree of trust the | 1 indicates complete distrust and 5 indicates complete trust. |
| respondent has in the Parliament (scale 1 to 5). | ||
| TrustPresid | The degree of trust the | 1 indicates complete distrust and 5 indicates complete trust. |
| respondent has in the presidency (scale 1 to 5). | ||
| WhyNotLoan | Variable categorizing the reason | 1: A loan was not needed/capital |
| why the entrepreneur did not borrow any | was sufficient (271), 2: Complex | |
| money to set up the business. Originally does | application procedures (18), 3: Unfavorable | |
| not apply if SuccessLoan=1, therefore, | interest rates (35), 4: High | |
| category 9 is created to internalize this group. | collateral requirements (14), 5: Unfavorable | |
| maturity and size of loan (3), 6: Necessity | ||
| of informal payments to banks (1), 7: Anticipation | ||
| of refusal (24), 8: Other reason | ||
| (66), 9: Did borrow (210). |
- Note: Some feature definitions in this table may be partly similar or identical to those listed in the source LiTS III code book (EBRD, 2016).
- The observations falling into each category are shown in bold and in parentheses within the corresponding variable definitions.
| Variable | Median | Mode | Min | Max |
|---|---|---|---|---|
| Czechia | 0 | 0 | 1 | |
| Demonstration | 2 | 1 | 3 | |
| OnlineNews | 6 | 1 | 7 | |
| RiskTaker | 5 | 1 | 10 | |
| SuccessLoan | 3 | 1 | 3 | |
| TrustBanks | 3 | 1 | 5 | |
| TrustParl | 2 | 1 | 5 | |
| TrustPresid | 3 | 1 | 5 | |
| WhyNotLoan | 1 | 1 | 9 | |
| Number of observations = 642 for all variables | ||||
- Note: The central tendency measure is reported as the median for ordinal variables, and the mode for categorical variables.
The top split in Figure 1 is related to the borrowing of funds: the entrepreneurs who either did not try to take out loans, and those who succeeded in borrowing funds are separated from the individuals who were not able to borrow (SuccessLoan). Focusing on the former group, the next split is made by the categorical feature indicating the reason of not taking out (or failing to take out) a loan (WhyNotLoan), further highlighting the role of capital constraints. More specifically, the entrepreneurs who were able to take out a loan (joining in from the previous split 1010 For preventing them from being dropped out from the analysis, this group was included by adding category 9 (defined in Table 1). Table 1) to the variable, even though the question asked by WhyNotLoan is not applicable to them.), those who did not need to borrow, and those who dismissed the idea due to certain conditions of repayment are separated from the individuals who expected to be refused or were deterred by complex application procedures, high interest rate levels, or collateral conditions. Tracing the former category down the tree, a country-specific split variable appears: if the individual is located in Czechia she/he is predicted to fail even in the presence of a high level of trust in banks. Individuals in Czechia who have lower trust in banks (TrustBanks) but often use online resources to follow the news (OnlineNews) are predicted to succeed as opposed to those who use online resources less frequently. Entrepreneurs based in other countries, on the other hand, are generally predicted to succeed, aside of a group who of individuals with presumably an excessively high level of risk-taking behavior (RiskTaker).
The particular case of Czechia observed in the single-tree context is also highlighted by the subsequent ensemble ML results in the present study, and the findings suggest the necessity of a case-specific approach on this country. Nevertheless, it is possible to observe some clues, in the literature, concerning the challenges pertaining to Czech rural areas. For instance, by emphasizing the lack of socially inclusive policies for the rural population in Czechia, Kucerova (2018) shows how the Czech rural development programs and NGO's have neglected the issue of social inclusion, and underlines the widening income gap between urban and rural households. Furthermore, a clear discrepancy between the business strategies of the rural entrepreneurs in Czechia and the local strategies formulated by the Local Action Groups founded by the European Common Agricultural Policy (CAP) is detected by Boukalova et al. (2016) following a quantitative analysis of 498 media articles relevant to the topic. Additional observations on the Czech entrepreneurial climate can be seen in the results of Avramenko and Silver (2010) and Dvouletý (2019) who both focus on Czech SME's in a rural and general contexts respectively. Avramenko and Silver (2010) find that Czech rural entrepreneurs who have foreign partners in other EU countries are dependent on their partners regarding issues such as management, sales, and production. Through a more recent and policy-oriented approach, Dvouletý (2019) suggests that tertiary education status should be included in the criteria for assessment used in business support programs in Czechia, based on empirical findings for the period 2005-2017. Business support and subsidy programs in Czechia have also been examined by Lukeš (2017) who draws attention to the fact that microcredit schemes are not included in government support programs for rural areas, and suggests the need for automatized practices to circumvent complicated application requirements. In the present study, the potential entrepreneurs in Czechia are well-represented; out of the 642 potential entrepreneurs in the dataset, 163 are located in Czechia, and the particularities pertaining to rural Czechia are consequently prominent in our results as well. 1111 Leaving the 163 individuals from Czechia out yields the classification tree shown in Figure A4 and its implications are discussed in Appendix A.3.
Looking back at the group who were deterred and did not attempt to borrow, we observe that features representing trust in political actors are selected as predictors (TrustPresid, TrustParl). More explicitly, those who place their trust guardedly on the presidency or the parliament are predicted to succeed as opposed to those with higher levels of trust in these institutions. The results pertaining to the seemingly counter-intuitive role of trust are further discussed in the concluding discussion in Section 7, following the presentation of all subsequent empirical models.
Finally, while a larger group of the entrepreneurs who were noa to borrow funds are strongly predicted to fail (i.e. with low impurity), those who have attended to demonstrations are expected to succeed, albeit with a higher impurity level in the corresponding terminal node. This variable is largely absent from our following ensemble results, and it may have been selected by the tree as it may convey how likely an individual is to “stand up for her/his rights.” However, it should be noted with emphasis that the partitions made by this single classification tree, and the paths it takes to reach certain predictions should not be over-analysed, and too much meaning should not be attributed on its feature selections. As discussed below in Section 5, different variable choices and a tree structures would likely result if the random train-test data split is redone, or a new group of individuals is used. The ensemble models implemented in the following sections aim to address these drawbacks of a single tree approach.
Setting aside the shortcomings of this “one-shot” prediction attempt for now, the selection of capital and borrowing-related predictors as top splitters by the single tree does not come as a surprise, and is observed also in the subsequent ML models presented in this study. Capital constraints are known to be major determinants of entrepreneurial activity. For instance, in this regard, using British micro data, Blanchflower and Oswald (1998) had shown that raising capital is the main problem facing entrepreneurs, and those who have received gifts or an inheritance are more likely to become self-employed.
5 BOOTSTRAP AGGREGATION AND RANDOM FOREST
The single classification tree in Figure 1 simply serves as a basis to the subsequently employed ensemble models in this study. Such a single tree has certain drawbacks: it is hardly robust to replacements to the data set, and prone to omitting strong predictors that are highly correlated to other features (Athey & Imbens, 2019; James et al., 2013). Nevertheless, the tree is useful to get some preliminary ideas on the mechanisms behind the success (or failure) of potential rural entrepreneurs in starting a business. On the other hand, the bootstrap aggregated and the random forest algorithms developed by Breiman (1996), Breiman (2001), yield many versions of the prediction structure (i.e. the single tree) and make the final prediction based on a majority vote, leading to considerable improvements in accuracy and reduction in the variance of the prediction function (Breiman, 1996; 2001; Friedman, 2001). The latter approach is particularly effective for the above mentioned shortcoming regarding the omission of relevant predictors, as it de-correlates the trees in the ensemble through randomized restrictions on the feature space (James et al., 2013).
The bootstrap aggregated model constructs L trees
each corresponding to a sample l of size N drawn from the training data (with replacement). The bootstrap aggregated prediction (also called the “bagging” prediction) for an individual i is given by the majority vote of the L trees. The classification error estimate for the test data can be obtained by predicting each observation using the samples in which they were out-of bag (OOB), and is used as a criterion to identify the optimum size of L for the bagging model (James et al., 2013). After looping over 10 to 500 trees, we find an optimum number of trees of 16. The resulting bagged prediction has an accuracy of 88 %.
In econometric models, the effect sizes of model variables are often presented in the form of elasticities. This approach allows researchers to isolate ceteris paribus effects of variables that are selected based on theoretical specifications, and assess the magnitudes and significances of explanatory variables relative to each other. Ensemble methods, on the other hand, do not follow specific functional forms, but instead permit all possible functional forms and non-linearities to be used (Mullainathan & Spiess, 2017). That is to say, instead of specifying a single equation, these models follow a “tuning” or adjustment process following a data-driven algorithm (Athey, 2018). However, it is important to note that concerns regarding causality are present, as these approaches are strictly oriented towards predictive accuracy rather than causal effects and their magnitudes, indicating the necessity of cautious research design (Grimmer, 2015). 1212 Causal effects are an emerging topic in ML models, particularly in relation to the usage of treatment effect techniques (Athey & Imbens, 2016). This issue is particularly important in the present analysis; and hence strong claims regarding causal effects are not stated throughout this study. While acknowledging the shortcomings of algorithmic big data techniques, we base our analysis on the advantages of these approaches, particularly on their effectiveness in performing out-of-sample predictions with high accuracy when there is a large number of potential predictors (Einav & Levin, 2014).
Despite the lack of parameters to be estimated (unlike in many econometric models), ensemble models provide useful information for comparing effect sizes of predictors in the form of “variable importance” values. Ultimately, any given variable may exhibit different effects in different trees in the ensemble. Moreover, in some trees belonging to the ensemble the variable may not be selected despite being relevant; a situation particularly relevant for the below discussed random forest model. Furthermore, the way the variable is used for prediction may be subject to different functional forms and interactions in each tree. Therefore, a single numeric ceteris paribus effect size as reported in regression tables is not generated by tree-based ML algorithms. Instead, the effect of the feature is quantified by calculating how useful it has been in achieving accurate predictions. For a given feature, this value of importance – or as formally termed – the variable importance, is computed by averaging, over all L trees, the total reduction in impurity due to a split over that feature. A reduction in impurity resulting from a split at node m is clearly an improvement towards prediction, and is calculated as:
. Scaling the resulting ΔImpurity value into a 1-100 interval allows us to assess the importances of the predictors with the highest contributions (James et al., 2013). The twenty most important features selected by the bagging model are shown in Figure 2a, and the definitions and descriptive statistics of the variables selected by the bagged and below discussed random forest model (which are not already defined in Table 1) are presented in Tables 3, 4. The results are discussed below in conjunction with the random forest model, following the application of the latter.

| Name | Description | Values | |
|---|---|---|---|
| Age | Age of the respondent. | Numeric value. | |
| Compete | Score measuring the respondent's opinion on competition. | A 1 to 10 scale where 1 indicates a positive view on competition as it “creates stimulation for idea development and hard work” and 10 indicates a negative view and that competition lets out the worse in people. | |
| CorruptDiff | Score measuring the strength of the opinion that ordinary a role in the struggle against corruption. | A 1 to 5 scale where 1 indicates strong disagreement and 5 indicates strong agreement. | |
| EconSyst | Categorical variable indicating the participant's views on the economic system. | 1: Prefers a market economy (295), 2: Prefers a planned economy under certain conditions (165), 3: It does not matter for people like her/him (182). | |
| Educ | The respondent's level of education. | Values representing the stages of education where 1 “no degree” and 8 indicates “PhD or Master's degree.” | |
| EducFather | The level of education of the respondent's father. | Values representing the stages of education where 1 “no degree” and 8 indicates “PhD or Master's degree.” | |
| EducMother | The level of education of the respondent's mother. | Values representing the stages of education where 1 “no degree” and 8 indicates “PhD or Master's degree.” | |
| Czech | Variable identifying the respondent's ethnicity as Czech. | Equals 1 if the respondent defines herself/himself as belonging to the ethnicity (116) and 0 otherwise (526). | |
| ImpSucc | Variable categorizing the choice of the respondent regarding the most important factor to be successful in her/his country | 1: Hard work and Effort (268), 2: Skills and Intelligence (166), 3: Political connections (148), 4: Breaking the law (38), 5: Other (22). | |
| IntGroup | Opinion on interest groups affect government decisions. | A 1 to 10 scale where 1 indicates that the respondent believes such groups do not cause any problems and 10 indicates that the respondent believes such groups should be restricted. | |
| PrivPub | Score measuring the commitment to private business activity. | A 1 to 10 scale where 1: “Private ownership of industry and business should be increased” and 10: “Public ownership of industry and business should be increased”. | |
| QuestAuth | Score measuring the tendency to question the actions of authorities. | A 1 to 10 scale where 1 indicates the highest tendency for questioning authorities and 10: indicates high tendency to respect and not question authorities. | |
| TrustCourts | The degree of trust the respondent has in the courts (scale 1 to 5). | 1 indicates complete distrust and 5 indicates complete trust. | |
| TrustGov | The degree of trust the respondent has in the government (scale 1 to 5). | 1 indicates complete distrust and 5 indicates complete trust. | |
| TrustNeigh | The degree of trust the respondent has in persons in her/his neighborhood (scale 1 to 5). | 1 indicates complete distrust and 5 indicates complete trust. | |
| UConnections | Score measuring the strength of the opinion that having connections is important for getting into university. | A 1 to 5 scale where 1 indicates no importance and 5 indicates essentiality. | |
| WhoLoan | Categorical variable indicating the source from where the respondent borrowed the necessary funds. | 1: A relative (26), 2: A friend (8), 3 A private money lender (9), 4: A bank (99), 5: An NGO or microfinance (3), 6: Other (4), 7: Did not borrow (493). | |
| WhyNotLoan2 | The selection of the option “complex application procedures” as the reason for not borrowing funds. | Category 2 of the earlier defined variable. | |
| WhyNotLoan3 | The selection of the option “unfavorable interest rates” as the reason for not borrowing funds. | Category 3 of the earlier defined variable. |
- Note: Some feature definitions in this table may be fully or partly similar to those in the source LiTS III code book (EBRD, 2016)
- The observations falling into each category are shown in bold and in parentheses within the corresponding variable definitions.
| Variable | Median | Mode | Min | Max |
|---|---|---|---|---|
| Age | 49 | 18 | 86 | |
| Compete | 3 | 1 | 10 | |
| CorruptDiff | 3 | 1 | 5 | |
| EconSyst | 1 | 1 | 3 | |
| Educ | 4 | 1 | 8 | |
| EducFather | 3 | 1 | 8 | |
| EducMother | 3 | 1 | 8 | |
| Czech | 0 | 0 | 1 | |
| ImpSucc | 1 | 1 | 5 | |
| IntGroup | 9 | 1 | 10 | |
| PrivPub | 5 | 1 | 10 | |
| QuestAuth | 4 | 1 | 10 | |
| TrustCourts | 3 | 1 | 5 | |
| TrustGov | 2 | 1 | 5 | |
| TrustNeigh | 4 | 1 | 5 | |
| UConnections | 1 | 1 | 5 | |
| WhoLoan | 7 | 1 | 5 | |
| Number of observations = 642 for all variables | ||||
- Note: The central tendency measure is reported as the median for ordinal variables, and the mode for categorical variables.
The random forest technique (Breiman, 2001) introduces additional stochasticity in order to cope with the between-tree correlation expected in the bagging approach (Friedman, 2001; James et al., 2013). For every tree in the random forest, the set of split candidates for a given node is limited to a randomly drawn subset of J variables out of the K predictors in the complete feature space. For classification models, the default size of J is
(Breiman, 2001; Friedman, 2001).
We apply a random forest procedure with 500 trees and J ≈ 15, which attains an accuracy of 92% on the test data. The random forest variable importance values are visualized alongside those of the bootstrap aggregated model in Figure 2b. The age of an individual, in line with the observation by (Meccheri & Pelloni, 2006), is identified as the most important factor by the bagged prediction, and is further elaborated through individual conditional expectation plots in Section 6.
Both the random forest and the bagging models highlight the role of the success or failure in taking out a loan (SuccessLoan), and the reasons behind the outcome of this activity (WhyNotLoan). The bagging algorithm further distinguishes among the importance of the categories of these variables, and underlines failure in borrowing money (SuccessLoan2) and the cases where aspiring entrepreneurs were not able to borrow money due to unfavorable interest rates (WhyNotLoan3) as strong predictors. Furthermore, in both models, trust-related factors such as trust placed in the individuals in one's neighborhood (TrustNeigh), the government (TrustGov), banks (TrustBanks), and the presidency (TrustPresid) have been used as top splitting variables and were effective in reducing Gini impurity. This outcome is in accordance with the earlier findings in the literature which show that the realization of potential ideas strongly depends on trust, which itself is an outcome of social relations (Akgün et al., 2010; Granovetter, 1985). Often viewed as an integral part of social capital, high levels of trust allow individuals to take more risks, accept or imitate technological innovations, and this in turn, leads to enhanced business activity (Cooke & Wills, 1999; Stathopoulou et al., 2004; Whiteley, 2000). However, as hinted in the earlier presented classification tree, the roles of trust-related factors may be contingent on other conditions and may be subject to non-linear structures. The two ensemble models successfully capture their effects by allowing all possible functional forms when these variables enter the prediction process.
Moreover, as found in the models, the endorsement of the private industry, represented by PrivPub, can be considered in conjunction with an individual's view on competition (Compete), her/his perception of the most important factor for success (ImpSucc), and to a lesser degree, her/his risk-taking behavior (RiskTaker) as a signal for the potential rural entrepreneur's “fighting spirit” in the face of competitive environments.
Apart from financial reasons, the level of education of the individual or her/his parents (EducMother, EducFather), which were not captured by the classification tree, are selected as important predictors (the random forest model only selects Educ with a low importance level). Based on both the random forest and bagged model results, other factors related to the views of the respondents on institutions (IntGroup), the authorities (QuestAuth), corruption (CorruptDiff), and the importance of personal connections (UConnections) are features that contribute to prediction with varying levels of importance. Finally, the result regarding rural entrepreneurs based in Czechia, also as observed in the classification tree, is implied by both ensemble models.
To conclude this section, the two and three dimensional proximity plots resulting from the random forest application are shown in Figure 3a, 3b respectively, where circles in dark green represent the individuals who have not been successful in setting up a business versus the light green circles which represent those who have been successful, and larger circle sizes represent older individuals. The proximities visualized in these plots are calculated using the OOB observations' pair-wise frequencies of sharing a terminal node (Breiman & Cutler, 2020; Friedman, 2001). Alternatively stated, at each iteration (tree) of the random forest, it is possible to predict the observations that were left out due to the random sampling made in that iteration (i.e. the above defined out-of-bag observations). It is then possible to record, at each instance, whether any given two observations are predicted to be in the same partitioned data region (tree leaf/terminal node). The proximity measure between those two observations would be then increased by 1 for every time a terminal node is shared at the end of an iteration, and an N × N matrix of proximities can be obtained after normalizing the proximity values by dividing them by the number of trees. Finally, the proximities among the potential entrepreneurs can be shown visually after expressing the values, say, in two or three dimensions using metric multidimensional scaling (Breiman & Cutler, 2020; Friedman, 2001). As a cluster of unsuccessful potential entrepreneurs is more or less visually recognizable on the left-hand-sides of the 2-D and 3-D proximity graphs in in Figure 3a, 3b (i.e. a grouping in predictions is achieved as opposed to an absence of a pattern – thanks to the random forest model), both plots imply that the random forest application has been quite successful in distinguishing between the two classes.

6 STOCHASTIC AND NON-STOCHASTIC GRADIENT BOOSTING MACHINES
The last two machine learning procedures applied in this study are sequentially additive tree models, namely: the gradient boosting machine (GBM, Friedman et al., 2001) and the stochastic gradient boosting machine algorithms (SGBM, Friedman, 2002). As a result, in addition to the earlier executed ensemble models, further information is obtained on the factors associated to rural entrepreneurship, this time through the application of sequential ensemble techniques where, in each succeeding tree, higher weight is given to potential entrepreneurs who were misclassified by the preceding tree. In other words, the trees in the ensemble are allowed to learn from their preceding counterparts by introducing them sequentially into the collection of trees. This added feature of learning is absent in the bagging and random forest algorithms, and can be seen as a strong alternative approach for prediction. While the advantage of the former two ensemble approaches used in this study are based on attempting the prediction through various randomization processes, the benefits of sequential models is based on serial correction of previous errors.
are generated for the |t| terminal nodes
belonging to tree t where
.
1313
The earlier used tree index for the subtrees mentioned in Section 4 is reused here for simplicity. The series of sequential trees noted here are unrelated to those from Section 4, despite the reuse of the same index notation. ,
1414
The regression tree method, which is defined as part of the Classification and Regression Trees (CART) framework of (Breiman et al., 1984) is not separately outlined in this paper as it is only an intermediate step in GBM and SGBM. The essential parameter in GBM and SGBM is the learning rate, denoted δ, as the subsequent tree updates the predictions made by the preceding tree by adding the residuals weighted by this rate. The recursive construct applies to every tree t until it stops when subsequent trees are no longer able to improve the predictions, and is expressed as
(2)
is the predicted value for individual i, and 0 < δ < 1 (Friedman, 2001; Friedman, 2002; Friedman et al., 2001).The slowed-down form of incremental learning introduced by δ allows many trees with diverse builds to consecutively improve on previous predictions (James et al., 2013). In addition to the learning rate, the maximum number of splits allowed and the number of iterations (trees) have to be tuned for applying a GBM or SGBM. A large number of iterations is needed if the learning rate is small, whereas a maximum split number (i.e. depth of interaction) between 4 and 8 has usually good performance (Friedman, 2001; Friedman et al., 2001).
Introducing stochasticity to GBM may improve predictions and can facilitate otherwise computationally challenging procedures. In a SGBM framework, randomization is incorporated in two ways: a subset of the training data set can randomly be drawn (without replacement) when predicting the residuals at each iteration, and/or a subset of the available variables can randomly be drawn at each node and tree, similar to the random forest algorithm (Friedman, 2002).
The GBM and SGBM algorithms are applied with the following parameter values. A learning rate of 0.005, a maximum depth of interaction of 6, and a sampling proportion of 0.5 (for SGBM only) is used in line with the demonstration by Friedman (2002) that these values help against overfitting due to large trees when the sample size is small. 1515 Friedman (2002) shows that these parameter values are useful, for gradient boosting machines with regression trees in particular. For SGBM, the fraction of variables to be selected in each split, the minimum observation number in the terminal nodes, and a maximum number of trees are selected as 0.8, 5, and 61 respectively, and δ is set to 0.1 following a grid search for these parameters. The algorithms yield accuracies of %89 for GBM and %79 for SGBM on the test data.
The variable importance plots for GBM and SGBM are presented in Figure 2c, 2d, and Tables 5, 6 present the definitions and descriptive statistics of the features which were not picked by the earlier applied ML algorithms, but selected by the two sequential ensemble models. Both models attribute the highest level of importance on the features related to borrowing the necessary funds to start a business: WhyNotLoan in the GBM and SuccessLoan in both the GBM and SGBM are by far the features with highest contributions to prediction accuracy. Variables related to trust levels also rank high in importance (TrustNeigh, TrustParl, TrustGov, TrustBanks), with Age, Czechia, and the individual's commitment to private business activity (PrivPub) exhibiting high contributions in line with our earlier findings. Several new variables previously not selected by the bootstrap aggregation and random forest algorithms are used by GBM and SGBM, but with relatively low levels of importance. Finally, additional predictors related to following the news (SocialMed, Newspaper), membership in sports clubs (SportMbr), and views on obeying the law and political activism (ObeyLaw, Petitions, Strike) are introduced by the sequential models, albeit with relatively low importance levels.
| Name | Description | Values |
|---|---|---|
| CnctDisp | Score measuring the strength of the opinion that having connections is important | A 1 to 5 scale where 1 indicates no importance and 5 indicates essentiality. |
| CorruptSoc | Score measuring the strength of the opinion of the participant on how acceptable it is for ordinary individuals to report a case of corruption in her/his society. | A 1 to 5 scale where 1 indicates strong disagreement and 5 indicates strong agreement. |
| Newspaper | A 1 to 7 scale measuring the frequency of following global and countrywide developments by reading newspapers. | 1 indicates “never” and 7 indicates “daily.” |
| ObeyLaw | Score measuring the commitment to obey the law. | A 1 to 10 scale where 1 indicates that the respondent strongly believes that individuals must obey the law and 10 indicates that the respondent believes that sometimes there are reasons to not obey the law. |
| Petitions2 | The selection of the option “Might do” in response to the question on the likelihood of the respondent to sign petitions. and 3 indicates that the respondent would never | Categorical variable where 1 indicates the respondent has already signed at least one (204), 2 indicates the respondent may sign a petition (276),sign one (162). |
| SocialMed | A 1 to 7 scale measuring the frequency of following global and countrywide developments through social media. | 1 indicates “never” and 7 indicates “daily.” |
| SportMbr | Variable asking whether the respondent is a member of a recreational/sport association or organization. | 1: Active member. 2: Inactive member. 3: Not a member. |
| Strike | Score measuring how likely the respondent is to participate in a strike. | Categorical variable where 1 indicates the respondent has already participated at least to one (102), 2 indicates the respondent may participate (270), and 3 indicates that the respondent would never participate (270). |
| TrustMeet | The degree of trust the respondent has in people she/he meets for the first time (scale 1 to 5). | 1 indicates complete distrust and 5 indicates complete trust. |
| TrustPolice | The degree of trust the respondent has in the police (scale 1 to 5). | 1 indicates complete distrust and 5 indicates complete trust. |
| CProblems.1 | The selection of the option “Health” as the most important problem in the country. | Binary variable. |
| News | A 1 to 7 scale measuring the frequency of following global and countrywide developments through in-depth reports on TV or radio. | 1 indicates “never” and 7 indicates “daily.” |
| Magazine | A 1 to 7 scale measuring the frequency of following global and countrywide developments by reading printed magazines. | 1 indicates “never” and 7 indicates “daily.” |
| Corrb | Score measuring how likely the respondent is to report corruption if witnessed. | A 1 to 5 scale where 1 indicates strong disagreement and 5 indicates strong agreement. |
| Talk | A 1 to 7 scale measuring the frequency of gaining information about the state of the country through talks with family members. | 1 indicates “never” and 7 indicates “daily.” |
- Note: Some feature definitions in this table may be fully or partly similar to those in the source LiTS III code book (EBRD, 2016)
- The observations falling into each category are shown in bold and in parentheses within the corresponding variable definitions.
| Variable | Median | Mode | Min | Max |
|---|---|---|---|---|
| CnctDisp | 2 | 1 | 5 | |
| CorruptSoc | 3 | 1 | 5 | |
| Newspaper | 5 | 1 | 7 | |
| ObeyLaw | 3 | 1 | 10 | |
| Petitions | 2 | 1 | 3 | |
| SocialMed | 5 | 1 | 7 | |
| SportMbr | 3 | 1 | 3 | |
| Strike | 2,3 | 1 | 3 | |
| TrustMeet | 3 | 1 | 5 | |
| TrustPolice | 4 | 1 | 5 | |
| Number of observations = 642 for all variables | ||||
- Note: The central tendency measure is reported as the median for ordinal variables, and the mode for categorical variables.
The vast majority of the features in the LiTS III data set are categorical variables. However, the features measuring the potential entrepreneur's age and level of education, generally selected by the ensemble and sequential algorithms employed in this study, are quantitative and ordinal measures. The individual conditional expectation plot (centered) for the predictor Age, and the partial dependence plot (PDP) for both age and the level of education are shown in Figure 4a, 4b respectively. 1616 The ICE plot is introduced by Goldstein et al. (2015), and the PDP is put forward by Friedman (2001). Figure 4a suggests that the probability of being successful in setting up a business is higher for middle-aged potential rural entrepreneurs compared to younger individuals and persons older than about 55 years old. In other words, the marginal effect of age on the entrepreneurial success outcome is lower for younger individuals and the elderly, highlighting the non-linear relationship between age and success. Each line represents a potential entrepreneur and how the probability of success for that individual (represented by the y-axis in Figure 4a) changes when different values for Age is used for that person ceteris paribus. While the wide spread of the individual lines over the y-axis imply considerable heterogeneity in the role of age, the general pattern seems to be consistent as noted by the red line which is the average of all individual lines (i.e., the red line is the PDP). All individual plots are centered on the probability 0.5 for a clearer visualization of the effect (note: these plots do not imply causality). In the two-way PDP shown in Figure 4b lighter colors represent higher probabilities of success relative to probabilities represented by darker colors where the y-axis and the x-axis represent the age and the level of education (Educ) respectively. The horizontal array of colors corresponding to the ages around 50 years old are in general lighter than the colors seen above (older individuals) and below (younger individuals), echoing the implication of Figure 4a. The horizontal axis on the other hand, represents the level of education, and suggests that younger individuals with a very low or medium level of education are predicted to be less successful in starting a business in rural areas, suggesting education as another non-linear predictor.

As the fourth and fifth ML algorithms employed in this study, the GBM and the SGBM are valuable alternatives to the earlier implemented bagging and random forest models. The two sets of models address the same prediction task in different ways and can be viewed as complements to each other. The models in Section 5 focus on randomization, either of the individual observations or the features, and generate each prediction independently. Therefore, each iteration in those models are fresh attempts to the same prediction task. As the random forest approach provides an opportunity for relevant features that may be overshadowed due to high correlations with other predictors to enter the prediction process, it may be viewed as superior to the bagging approach thanks to the earlier discussed randomization in selecting the splitting features. 1717 Indeed, the article where Breiman introduced the random forest model (Breiman, 2001), is clearly an improvement over his previous work on bootstrap aggregation, presented in (Breiman, 1996). As for the two variants of the gradient boosted models used in this study, a clear superiority over random forest approach is not observed in the ML literature. While the latter model aggregates many complex trees, the GBM aims to sequentially improve a series of shallow trees (i.e. highly pruned weak learners). The performance of the GBM compared to the random forest model can only be observed after implementing the two algorithms. In our case, the GBM and the random forest models yield very similar accuracies (%89 and %92), and the SGBM performs with less accuracy. The models do not diverge much regarding their selection of the top splitting features, reinforcing the conclusions drawn by each other. An alternative look at the consistence of the findings across the different ensemble models is presented in Table 7. The X's denote that the corresponding variables listed in the first column are selected as one of the top twenty predictors by the ensemble models categorized in the remaining four columns. While Table 7 disregards the importance levels presented in Figure 4, it gives an overall picture of the accord among the four algorithms with regard to each variable. The ranking in the table is based firstly on the number of models in which a feature was selected, and secondly on alphabetical order. Success in taking a loan and trust related variables (SuccessLoan, TrustBanks, TrustGov, TrustNeigh) rank high, alongside with the age of the entrepreneur, her/his view of private endeavor (PrivPub), and whether she/he is located in rural Czechia. Therefore, the grouping in Table 7 confirms, albeit in a different manner, the main results highlighted in this study.
| Variable | Bagging | Random Forest | GBM | SGBM |
|---|---|---|---|---|
| Age | X | X | X | X |
| Czechia | X | X | X | X |
| PrivPub | X | X | X | X |
| SuccessLoan | X | X | X | X |
| TrustBanks | X | X | X | X |
| TrustGov | X | X | X | X |
| TrustNeigh | X | X | X | X |
| IntGroup | X | X | X | |
| QuestAuth | X | X | X | |
| RiskTaker | X | X | X | |
| TrustParl | X | X | X | |
| UConnections | X | X | X | |
| WhyNotLoan | X | X | X | |
| Compete | X | X | ||
| Czech | X | X | ||
| EconSyst | X | X | ||
| Educ | X | X | ||
| EducFather | X | X | ||
| Newspaper | X | X | ||
| SportMbr | X | X | ||
| TrustPresid | X | X | ||
| CnctDisp | X | |||
| CorruptDiff | X | |||
| CorruptSoc | X | |||
| Demonstration | X | |||
| EducMother | X | |||
| ImpSucc | X | |||
| ObeyLaw | X | |||
| OnlineNews | X | |||
| Petitions | X | |||
| SocialMed | X | |||
| Strike | X | |||
| TrustCourts | X | |||
| TrustMeet | X | |||
| TrustPolice | X | |||
| WhoLoan | X |
7 CONCLUDING REMARKS AND POLICY IMPLICATIONS
Entrepreneurship is an individual activity that is most likely driven by personal reasons (Wiklund et al., 2019). It is a critical activity that lays the foundation of economic liveliness, and contributes to the well-being of the society in many ways, such as in alleviating poverty (Sutter et al., 2019). The present study introduced a new, different way of understanding the mechanisms of rural entrepreneurship through the use of five alternative machine learning algorithms. Using the LiTS III data, we observed, apart from the expected effects of capital constraints, that trust in the individuals in the immediate surroundings and institutions are important determinants of success in setting up a business, alongside capital constraints. The age of the rural entrepreneur is observed to have a clear and non-linear relationship with entrepreneurial success. The frequency and the mode of following current news, education, and to a lesser extent, personal traits related to risk taking, views on the workings of the free market and institutions, have contributed in achieving high levels of predictive accuracies in our models.
Considering the findings regarding capital constraints as critical hindrances facing entrepreneurs in rural Europe, it should be noted that the importance of funding for rural areas has been a major area of interest for the European Union (European Comission Agriculture and Rural Development, 2011). Nevertheless, the recent data examined in the present study suggests that problems concerning access to capital are persistent; the predictive power of the models presented in this study relied heavily on the answers of the participants on whether or not they were able to take loans to start their businesses. The results on the trust related factors, on the other hand, suggest that entrepreneurs who have succeeded in starting a business have somewhat lower level of trust on political institutions and banks. This kind of an observation regarding trust is not counter-intuitive and it has been discussed in the entrepreneurship literature; too much trust in the government may signal a lack of independence and an expectation that the government should solve every problem (Bhatt et al., 2017; Li, 2013). More specifically, a potential entrepreneur who exhibits over-trusting behavior may assume, often in error, that the other party (government, banks, neighbors, etc.) will not be detrimental to the process, and refrain from going beyond a circle of known and trusted business partners (Welter, 2012; Welter & Smallbone, 2010).
The possible issue of inadequate independence which, in turn, can lead to entrepreneurial diffidence, may be the result of cultural factors particular to rural communities. In fact, EU policymakers have frequently expressed this notion. For instance, alongside with most of the disadvantages of rural settings discussed in this study, the policymakers of the European Network for Rural Development (ENRD) underline how local and family cultures and patterns generally discourage entrepreneurial activity in rural areas (European Comission Agriculture and Rural Development, 2011). Designing policies which target creating financial opportunities for potential entrepreneurs in tandem with promoting an independent and entrepreneurial character, instead of treating the two issues in an isolated manner, could lead to positive entrepreneurial outcomes.
Finally, a location-specific and policy-relevant finding has been discovered by the ML models, suggesting that in Czechia the prospects for successful rural entrepreneurship are particularly poor. While the case of Czechia has been discussed in relation to the previous research done on the entrepreneurial environment in this country in Section 4, in-depth empirical case-study approaches are arguably needed for providing more concrete explanations of the underlying mechanisms particular to rural Czechia.
APPENDIX: DATA PARTITIONING
For the purpose of further elaborating on the classification trees and the partition algorithm discussed in Section 4 we present in Figure A3 a partition based on only two features: Trust in one's neighborhood (TrustNeigh) and Age. The former is selected on the basis that it appears as a top ordinal splitter in many of the models applied in this study. On the other hand, the latter feature, Age, is selected by the topmost splitter by the bootstrap aggregation algorithm and provides a wide range of values unlike the remaining ordinal variables which usually range only from one to five, seven, or ten. Because the complementary data partition graph presented in Figure A2 is not suitable for representing the partitions made by non-ordinal categorical predictors, features representing the success of borrowing or the reason for not borrowing are not used here.

As demonstrated in the below two diagrams, a partition process divides the feature space into separate regions based on the predictions by the classification tree. The data partition plot in Figure A2 is another way of representing the sample two-variable classification tree presented in Figure A3. The blue dots represent the individuals who have been successful in starting a business while the dark yellow dots represent the “No” category. The areas shaded green contain the observations predicted as successful by the two-variable tree, and the regions colored in cyan indicate the leaves corresponding to predicted outcomes of “NO.” Of course, these two-variable representations – while providing a better understanding of the data partition process – do not provide any useful empirical information, as almost all the potential predictors are ignored.


1 Unpruned Tree
For the purpose of comparison to the single regression tree generated in Figure 4, the unpruned version of the tree is presented below. While overfitting is a clear concern in this case, the tree provides some useful hints on how the categories of data at hand could be predicted, if the out-of-sample performance of the algorithm is less emphasized.
2 A Tree Without Czechia
In order to check the robustness of the results, a classification tree leaving out the 163 individuals from Czechia is generated and presented in Figure A4. The top predictors of the tree that uses the full data (Figure A4) regarding capital constraints still determine the top splits in this non-Czechia version. The tree, which omits the attributes specific to the Czech rural setting, unsurprisingly now extends to include a larger set of the remaining features selected by the ensemble models presented in Section 5 except for the predictors defined in the below presented Table A1. Following the news less through broadcasts predicts a positive outcome. This could indicate that being confined to older methods of communication instead of modern ones is associated with unsuccessful outcomes. Frequent social contact is not associated with success if it is in the form of meeting friends. Finally, being in the German region of Niedersachsen is associated with unsuccessful attempts to setting up a rural business. Therefore, another case-specific result of policy interest may pertain to Niedersachsen. This need is supported by the 2013 report “The Regional Entrepreneurship and Development Index - Measuring regional entrepreneurship” by the European Commission Directorate General for Regional and Urban policy (Szerb et al., 2011) which listed Niedersachsen as a region with a significantly lower level of human capital relative to the other regions in the EU, and among regions with a high need to improve “…the populations' self-esteem about its ability to start successfully a business,” and its education quality.

| Name | Description | Values |
|---|---|---|
| FriendsMt | A 1 to 5 scale measuring the frequency of meeting friends and family (excluding household members). | 1 indicates “on most days” and 7 indicates “never.” |
| Niedersachsen | Binary variable indicating the region of the respondent as Niedersachsen, Germany. | Equals one if the respondent is based in the region and zero otherwise. |
| Broadcast | A 1 to 7 scale measuring the frequency of the usage of news Broadcasts for following the global and countrywide developments. | 1 indicates “never” and 7 indicates “daily.” |
- Note: Some feature definitions in this table may be fully or partly similar to those in the source LiTS III code book (EBRD, 2016).


