Learning From Revealed Preference [Extended Abstract] Eyal Beigman CMS-EMS Kellogg School of Management Northwestern University Evanston IL 60208 e-beigman@northwestern.edu Rakesh Vohra MEDS Kellogg School of Management Northwestern University Evanston IL 60208 r-vohra@northwestern.edu ABSTRACT A sequence of prices and demands are rationalizable if there exists a concave, continuous and monotone utility function such that the demands are the maximizers of the utility function over the budget set corresponding to the price. Afriat [1] presented necessary and sufficient conditions for a finite sequence to be rationalizable. Varian [20] and later Blundell et al. [3, 4] continued this line of work studying nonparametric methods to forecasts demand. Their results essentially characterize learnability of degenerate classes of demand functions and therefore fall short of giving a general degree of confidence in the forecast. The present paper complements this line of research by introducing a statistical model and a measure of complexity through which we are able to study the learnability of classes of demand functions and derive a degree of confidence in the forecasts. Our results show that the class of all demand functions has unbounded complexity and therefore is not learnable, but that there exist interesting and potentially useful classes that are learnable from finite samples. We also present a learning algorithm that is an adaptation of a new proof of Afriat"s theorem due to Teo and Vohra [17]. Categories and Subject Descriptors F.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity; J.4 [Computer Applications]: Social and Behavioral Sciences-Economics; I.2.6 [Learning]: Parameter Learning 1. INTRODUCTION A market is an institution by which economic agents meet and make transactions. Classical economic theory explains the incentives of the agents to engage in this behavior through the agents" preference over the set of available bundles indicating that agents attempt to replace their current bundle with bundles that are both more preferred and attainable if such bundles exist. The preference relation is therefore the key factor in understanding consumer behavior. One of the common assumptions in this theory is that the preference relation is represented by a utility function and that agents strive to maximize their utility given a budget constraint. This pattern of behavior is the essence of supply and demand, general equilibria and other aspects of consumer theory. Furthermore, as we elaborate in section 2, basic observations on market demand behavior suggest that utility functions are monotone and concave. This brings us to the question, first raised by Samuelson [18], to what degree is this theory refutable? Given observations of price and demand, under what circumstances can we conclude that the data is consistent with the behavior of a utility maximizing agent equipped with a monotone concave utility function and subject to a budget constraint? Samuelson gave a necessary but insufficient condition on the underlying preference known as the weak axiom of revealed preference. Uzawa [16] and Mas-Colell [10, 11] introduced a notion of income-Lipschitz and showed that demand functions with this property are rationalizable. These properties do not require any parametric assumptions and are technically refutable, but they do assume knowledge of the entire demand function and rely heavily on the differential properties of demand functions. Hence, an infinite amount of information is needed to refute the theory. It is often the case that apart form the demand observations there is additional information on the system and it is sensible to make parametric assumptions, namely, to stipulate some functional form of utility. Consistency with utility maximization would then depend on fixing the parameters of the utility function to be consistent with the observations and with a set of equations called the Slutski equations. If such parameters exist, we conclude that the stipulated utility form is consistent with the observations. This approach is useful when there is reason to make these stipulations, it gives an explicit utility function which can be used to make precise forecasts on demand for unob36 served prices. The downside of this approach is that real life data is often inconsistent with convenient functional forms. Moreover, if the observations are inconsistent it is unclear whether this is a refutation of the stipulated functional form or of utility maximization. Addressing these issues Houthakker [7] noted that an observer can see only finite quantities of data. He askes when can it be determined that a finite set of observations is consistent with utility maximization without making parametric assumptions? He showes that rationalizability of a finite set of observations is equivalent to the strong axiom of revealed preference. Richter [15] showes that strong axiom of revealed preference is equivalent to rationalizability by a strictly concave monotone utility function. Afriat [1] gives another set of rationalizability conditions the observations must satisfy. Varian [20] introduces the generalized axiom of revealed preference (GARP), an equivalent form of Afriat"s consistency condition that is easier to verify computationally. It is interesting to note that these necessary and sufficient conditions for rationalizability are essentially versions of the well known Farkas lemma [6] (see also [22]). Afriat [1] proved his theorem by an explicit construction of a utility function witnessing consistency. Varian [20] took this one step further progressing from consistency to forecasting. Varian"s forecasting algorithm basically rules out bundles that are revealed inferior to observed bundles and finds a bundle from the remaining set that together with the observations is consistent with GARP. Furthermore, he introduces Samuelson"s money metric as a canonical utility function and gives upper and lower envelope utility functions for the money metric. Knoblauch [9] shows these envelopes can be computed efficiently. Varian [21] provides an up to date survey on this line of research. A different approach is presented by Blundell et al. [3, 4]. These papers introduce a model where an agent observes prices and Engel curves for these prices. This gives an improvement on Varian"s original bounds, though the basic idea is still to rule out demands that are revealed inferior. This model is in a sense a hybrid between Mas-Colell and Afriat"s aproaches. The former requires full information for all prices, the latter for a finite number of prices. On the other hand the approach taken by Blundell et al. requires full information only on a finite number of price trajectories. The motivation for this crossover is to utilize income segmentation in the population to restructure econometric information. Different segments of the population face the same prices with different budgets, and as much as aggregate data can testify on individual preferences, show how demand varies with the budget. Applying non parametric statistical methods, they reconstruct a trajectory from the observed demands of different segments and use it to obtain tighter bounds. Both these methods would most likely give a good forecast for a fixed demand function after sufficiently many observations assuming they were spread out in a reasonable manner. However, these methods do not consider the complexity of the demand functions and do not use any probabilistic model of the observations. Therefore, they are unable to provide any estimate of the number of observations that would be sufficient for a good forecast or the degree of confidence in such a forecast. In this paper we examine the feasibility of demand forecasting with a high degree of confidence using Afriat"s conditions. We formulate the question in terms of whether the class of demand functions derived from monotone concave utilities is efficiently PAC-learnable. Our first result is negative. We show, by computing the fat shattering dimension, that without any prior assumptions, the set of all demand functions induced by monotone concave utility functions is too rich to be efficiently PAC-learnable. However, under some prior assumptions on the set of demand functions we show that the fat shattering dimension is finite and therefore the corresponding sets are PAC-learnable. In these cases, assuming the probability distribution by which the observed price-demand pairs are generated is fixed, we are in a position to offer a forecast and a probabilistic estimate on its accuracy. In section 2 we briefly discuss the basic assumptions of demand theory and their implications. In section 3 we present a new proof to Afriat"s theorem incorporating an algorithm for efficiently generating a forecasting function due to Teo and Vohra [17]. We show that this algorithm is computationally efficient and can be used as a learning algorithm. In section 4 we give a brief introduction to PAC learning including several modifications to learning real vector valued functions. We introduce the notion of fat shattering dimension and use it to devise a lower bound on the sample complexity. We also sketch results on upper bounds. In section 5 we study the learnability of demand functions and directly compute the fat shattering dimension of the class of all demand functions and a class of income-Lipschitzian demand functions with a bounded global income-Lipschitz constant. 2. UTILITY AND DEMAND A utility function u : Rn + → R is a function relating bundles of goods to a cardinal in a manner reflecting the preferences over the bundles. A rational agent with a budget that w.l.g equals 1 facing a price vector p ∈ Rn + will choose from her budget set B(p) = {x ∈ Rn + : p · x ≤ 1} a bundle x ∈ Rn + that maximizes her private utility. The first assumption we make is that the function is monotone increasing, namely, if x ≥ y, in the sense that the inequality holds coordinatewise, then u(x) ≥ u(y). This reflects the assumption that agents will always prefer more of any one good. This, of course, does not necessarily hold in practice, as in many cases excess supply may lead to storage expenses or other externalities. However, in such cases the demand will be an interior point of the budget set and the less preferred bundles won"t be observed. The second assumption we make on the utility is that all the marginals (partial derivatives) are monotone decreasing. This is the law of diminishing marginal utility which assumes that the larger the excess of one good over the other the less we value each additional good of one kind over the other. These assumptions imply that the utility function is concave and monotone on the observations. The demand function of the agent is the correspondence fu : Rn + → Rn + satisfying f(p) = argmax{u(x) : p · x ≤ I} In general this correspondence is not necessarily single valued, but it is implicit in the proof of Afriat"s theorem that any set of observations can be rationalized by a demand function that is single valued for unobserved prices. 37 Since large quantities of any good are likely to create utility decreasing externalities, we assume the prices are limited to a compact set. W.l.g. we assume u has marginal utility zero outside [0, 1]d . Any budget set that is not a subset of the support is maximized on any point outside the support and it is therefore difficult to forecast for these prices. We are thus interested in forecasts for prices below the simplex ∆d = conv{(0, . . . , 1, . . . , 0)}. For these prices we take the metric dP (p, p ) = max{| 1 pi − 1 pi | : i = 1, . . . , d} for p, p ∈ ∆d. Note that with this metric ∆d is compact. A demand function is L-income-Lipschitz, for L ∈ R+, if ||f(p) − f(p )||∞ dP (p, p ) ≤ L for any p, p ∈ ∆d. This property reflects an assumption that preferences and demands have some sort of stability. It rules out different demands for the similar prices. We may therefore assume from here on that demand functions are single valued. 3. REVEALED PREFERENCE A sequence of prices and demands (p1, x1), . . . , (pn, xn) is rationalizable if there exists a utility function u such that xi = fu(pi) for i = 1, . . . , n. We begin with a trivial observation, if pi · xj ≤ pi · xi and xi = f(pi) then xi is preferred over xj since the latter is in the budget set when the former was chosen. It is therefore revealed that u(xj) ≤ u(xi) implying pj · xj ≤ pj · xi. Suppose there is a sequence (pi1 , xi1 ), . . . , (pik , xik ) such that pij · (xij − xij+1 ) ≤ 0 for j = 1 . . . k − 1 and pik · (xik − xi1 ) ≤ 0. Then the same reasoning shows that u(xi1 ) = u(xi2 ) = . . . = u(xik ) implying pi1 · (xi1 − xi2 ) = pi2 · (xi2 − xi3 ) = . . . = pik−1 ·(xik−1 −xik ) = 0. We call the latter condition the Afriat condition (AC). This argument shows that AC is necessary for rationalizability; the surprising result in Afriat"s theorem is that this condition is also sufficient. Let A be an n × n matrix with entries aij = pi · (xj − xi) (aij and aji are independent), aii = 0 and let D(A) be the weighted digraph associated with A. The matrix satisfies AC if every cycle with negative total weight includes at least one edge with positive weight. Theorem 1. There exists y = (y1, . . . , yn) ∈ Rn and s = (s1, . . . , sn) ∈ Rn + satisfying the set of inequalities L(A), yj ≤ yi + siaij i = j 1 ≤ i, j ≤ n iff D(A) satisfies AC. Proof : If L(A) is feasible then it is easy to see that u(x) = min i {yi + sipi(x − xi)} is a concave utility function that is consistent with the observations, and from our previous remark it follows that D(A) satisfies AC. In the other direction it is shown by explicit construction that Afriat"s condition for D(A) implies L(A) is feasible. The construction provides a utility function that is consistent with the observations. Teo and Vohra [17] give a strongly polynomial time algorithm for this construction which will be the heart of our learning algorithm. The construction is executed in two steps. First, the algorithm finds s ∈ Rn + such that the weighted digraph D(A, s) defined by the matrix ˜aij = siaij has no cycle with negative total weight if D(A) satisfies AC and returns a negative cycle otherwise. The dual of a shortest path problem is given by the constraints: yj − yi ≤ siaij i = j It is a standard result (see [14] p 109) that the system is feasible iff D(A, s) has no negative cycles. Thus, in the second step, if D(A) satisfies AC, the algorithm calls a SHORTEST PATH algorithm to find y ∈ Rn satisfying the constraints. Now we describe how to choose the si"s. Define S = {(i, j) : aij < 0}, E = {(i, j) : aij = 0} and T = {(i, j) : aij > 0} and let G = ([n], S ∪ E) be a digraph with weights wij = −1 if (i, j) ∈ S and wij = 0 otherwise. D(A) has no negative cycles, hence G is acyclic and breadth first search can assign potentials φi such that φj ≤ φi + wij for (i, j) ∈ S ∪ E. We relabel the vertices so that φ1 ≥ φ2 ≥ . . . ≥ φn. Let δi = (n − 1) max(i,j)∈S(−aij) min(i,j)∈T aij if φi < φi−1 and δi = 1 otherwise, and define si = iY j=2 δj = δi · si−1 . We show that for this choice of s, D(A, s) contains no negative weight cycle. Suppose C = (i1, . . . , ik) is a cycle in D(A, s). If φ is constant on C then aij ij+1 = 0 for j = 1, . . . , k and we are done. Otherwise let iv ∈ C be the vertex with smallest potential satisfying w.l.o.g. φ(iv) < φ(iv+1). For any cycle C in the digraph D(A, s), let (v, u) be an edge in C such that (i) v has the smallest potential among all vertices in C, and (ii) φu > φv. Such an edge exists, otherwise φi is identical for all vertices i in C. In this case, all edges in C have non-negative edge weight in D(A, s). If (iv, iv+1) ∈ S ∪ E, then we have φ(iv+1) ≤ φ(iv) + wiv,iv+1 ≤ φ(iv) a contradiction. Hence (iv, iv+1) ∈ T. Now, note that all vertices q in C with the same potential as iv must be incident to an edge (q, t) in C such that φ(t) ≥ φ(q). Hence the edge (q, t) must have non-negative weight. i.e., aq,t ≥ 0. Let p denote a vertex in C with the second smallest potential. Now, C has weight svavu+ X (k,l)∈C\(v,u) skak,l ≥ svav,u+sp(n−1) max (i,j)∈S {aij } ≥ 0, i.e., C has non-negative weight ✷ Algorithm 1 returns in polynomial time a hypothesis that is a piecewise linear function and agrees with the labeling of the observation namely sample error zero. To use this function to forecast demand for unobserved prices we need algorithm 2 which maximizes the function on a given budget set. Since u(x) = mini{yi + sipi(x − xi)} this is a linear program and can be solved in time polynomial in d, n as well as the size of the largest number in the input. 38 Algorithm 1 Utility Algorithm Input (x1, p1), . . . , (xn, pn) S ← {(i, j) : aij < 0} E ← {(i, j) : aij = 0} for all (i, j) ∈ S do wij ← −1 end for for all (i, j) ∈ E do wij ← 0 end for while there exist unvisited vertices do visit new vertex j assign potential to φj end while reorder indices so φ1 ≤ φ2 . . . ≤ φn for all 1 ≤ i ≤ n do δi ← (n − 1) max(i,j)∈S (−aij ) min(i,j)∈T aij si ← Qi j=2 δj end for SHORTEST PATH(yj − yi ≤ siaij) Return y1, . . . , yn ∈ Rd and s1, . . . , sn ∈ R+ Algorithm 2 Evaluation Input y1, . . . , yn ∈ Rd and s1, . . . , sn ∈ R+ max z z ≤ yi + sipi(x − xi) for i = 1, . . . , n px ≤ 1 Return x for which z is maximized 4. SUPERVISED LEARNING In a supervised learning problem, a learning algorithm is given a finite sample of labeled observations as input and is required to return a model of the functional relationship underlying the labeling. This model, referred to as a hypothesis, is usually a computable function that is used to forecast the labels of future observations. The labels are usually binary values indicating the membership of the observed points in the set that is being learned. However, we are not limited to binary values and, indeed, in the demand functions we are studying the labels are real vectors. The learning problem has three major components: estimation, approximation and complexity. The estimation problem is concerned with the tradeoff between the size of the sample given to the algorithm and the degree of confidence we have in the forecast it produces. The approximation problem is concerned with the ability of hypotheses from a certain class to approximate target functions from a possibly different class. The complexity problem is concerned with the computational complexity of finding a hypothesis that approximates the target function. A parametric paradigm assumes that the underlying functional relationship comes from a well defined family, such as the Cobb-Douglas production functions; the system must learn the parameters characterizing this family. Suppose that a learning algorithm observes a finite set of production data which it assumes comes from a Cobb-Douglas production function and returns a hypothesis that is a polynomial of bounded degree. The estimation problem in this case would be to assess the sample size needed to obtain a good estimate of the coefficients. The approximation problem would be to assess the error sustained from approximating a rational function by a polynomial. The complexity problem would be the assessment of the time required to compute the polynomial coefficients. In the probably approximately correct (PAC) paradigm, the learning of a target function is done by a class of hypothesis functions, that does or does not include the target function itself; it does not necessitate any parametric assumptions on this class. It is also assumed that the observations are generated independently by some distribution on the domain of the relation and that this distribution is fixed. If the class of target functions has finite "dimensionality" then a function in the class is characterized by its values on a finite number of points. The basic idea is to observe the labeling of a finite number of points and find a function from a class of hypotheses which tends to agree with this labeling. The theory tells us that if the sample is large enough then any function that tends to agree with the labeling will, with high probability, be a good approximation of the target function for future observations. The prime objective of PAC theory is to develop the relevant notion of dimensionality and to formalize the tradeoff between dimensionality, sample size and the level of confidence in the forecast. In the revealed preference setting, our objective is to use a set of observations of prices and demand to forecast demand for unobserved prices. Thus the target function is a mapping from prices to bundles, namely f : Rd + → Rd +. The theory of PAC learning for real valued functions is concerned predominantly with functions from Rd to R. In this section we introduce modifications to the classical notions of PAC learning to vector valued functions and use them to prove a lower bound for sample complexity. An upper bound on the sample complexity can also be proved for our definition of fat shattering, but we do not bring it here as the proof is much more tedious and analogous to the proof of theorem 4. Before we can proceed with the formal definition, we must clarify what we mean by forecast and tend to agree. In the case of discrete learning, we would like to obtain a function h that with high probability agrees with f. We would then take the probability Pσ(f(x) = h(x)) as the measure of the quality of the estimation. Demand functions are real vector functions and we therefore do not expect f and h to agree with high probability. Rather we are content with having small mean square errors on all coordinates. Thus, our measure of estimation error is given by: erσ(f, h) = Z (||f − h||∞)2 dσ. For given observations S = {(p1, x1), . . . , (pn, xn)} we measure the agreement by the sample error erS(S, h) = X j (||xj − h(pj)||∞)2 . A sample error minimization (SEM) algorithm is an algorithm that finds a hypothesis minimizing erS(S, h). In the case of revealed preference, there is a function that takes the sample error to zero. Nevertheless, the upper bounds theorem we use does not require the sample error to be zero. Definition 1. A set of demand functions C is probably approximately correct (PAC) learnable by hypotheses set H if for any ε, δ > 0, f ∈ C and distribution σ on the prices 39 there exists an algorithm L that for a set of observations of length mL = mL(ε, δ) = Poly(1 δ , 1 ε ) finds a function h from H such that erσ(f, h) < ε with probability 1 − δ. There may be several learning algorithms for C with different sample complexities. The minimal mL is called the sample complexity of C. Note that in the definition there is no mention of the time complexity to find h in H and evaluating h(p). A set C is efficiently PAC-learnable if there is a Poly(1 δ , 1 ε ) time algorithm for choosing h and evaluating h(p). For discrete function sets, sample complexity bounds may be derived from the VC-dimension of the set (see [19, 8]). An analog to this notion of dimension for real functions is the fat shattering dimension. We use an adaptation of this notion to real vector valued function sets. Let Γ ⊂ Rd + and let C be a set of real functions from Γ to Rd +. Definition 2. For γ > 0, a set of points p1, . . . , pn ∈ Γ is γ-shattered by a class of real functions C if there exists x1, . . . , xn ∈ Rd + and parallel affine hyperplanes H0, H1 ⊂ Rd such that 0 ∈ H− 0 ∩ H+ 1 , dist(H0, H1) > γ and for each b = (b1, . . . , bn) ∈ {0, 1}n there exists a function fb ∈ C such that fb(pi) ∈ xi + H+ 0 if bi = 0 and f(pi) ∈ xi + H− 1 if bi = 1. We define the γ-fat shattering dimension of C, denoted fatC(γ) as the maximal size of a γ-shattered set in Γ. If this size is unbounded then the dimension is infinite. To demonstrate the usefulness of the this notion we use it to derive a lower bound on the sample complexity. Lemma 2. Suppose the functions {fb : b ∈ {0, 1}n } witness the shattering of {p1, . . . , pn}. Then, for any x ∈ Rd + and labels b, b ∈ {0, 1}n such that bi = bi either ||fb(pi) − x||∞ > γ 2d or ||fb (pi) − x||∞ > γ 2d . Proof : Since the max exceeds the mean, it follows that if fb and fb correspond to labels such that bi = bi then ||fb(pi) − fb (pi)||∞ ≥ 1 d ||fb(pi) − fb (pi)||2 > γ d . This implies that for any x ∈ Rd + either ||fb(pi) − x||∞ > γ 2d or ||fb (pi) − x||∞ > γ 2d ✷ Theorem 3. Suppose that C is a class of functions mapping from Γ to Rd +. Then any learning algorithm L for C has sample complexity satisfying mL(ε, δ) ≥ 1 2 fatC(4dε) An analog of this theorem for real valued functions with a tighter bound can be found in [2], this version will suffice for our needs. Proof : Suppose n = 1 2 fatC(4dε) then there exists a set ΓS = {p1, . . . , p2n} that is shattered by C. It suffices to show that at least one distribution requires large sample. We construct such a distribution. Let σ be the uniform distribution on ΓS and CS = {fb : b ∈ {0, 1}2n } be the set of functions that witness the shattering of {p1. . . . , pn}. Let fb be a function chosen uniformly at random from CS. It follows from lemma 2 (with γ = 2d ) that for any fixed function h the probability that ||fb(p) − h(p)||∞ > 2ε for p ∈ ΓS is at least as high as getting heads on a fair coin toss. Therefore Eb(||fb(p) − h(p)||∞) > 2ε. Suppose for a sequence of observations z = ((pi1 , x1), . . . , (pin , xn)) a learning algorithm L finds a function h. The observation above and Fubini imply Eb(erσ(h, fb)) > ε. Randomizing on the sample space we get Eb,z(erσ(h, fb)) > ε. This shows Eh,z(erσ(h, fb0 )) > ε for some fb0 . W.l.g we may assume the error is bounded (since we are looking at what is essentially a finite set) therefore the probability that erσ(h, fb0 ) > ε cannot be too small, hence fb0 is not PAClearnable with a sample of size n ✷ The following theorem gives an upper bound on the sample complexity required for learning a set of functions with finite fat shattering dimension. The theorem is proved in [2] for real valued functions, the proof for the real vector case is analogous and so omitted. Theorem 4. Let C be a set of real-valued functions from X to [0, 1] with fatC(γ) < ∞. Let A be approximate-SEM algorithm for C and define L(z) = A(z, ε0 6 ) for z ∈ Zm and ε0 = 16√ m . Then L is a learning algorithm for C with sample complexity given by: mL(ε, δ) = O „ 1 ε2 (ln2 ( 1 ε )fatC(ε) + ln( 1 δ )) « for any ε, δ > 0. 5. LEARNING FROM REVEALED PREFERENCE Algorithm 1 is an efficient learning algorithm in the sense that it finds a hypothesis with sample error zero in time polynomial in the number of observations. As we have seen in section 4 the number of observations required to PAC learn the demand depends on the fat shattering dimension of the class of demand functions which in turn depends on the class of utility functions they are derived from. We compute the fat shattering dimension for two classes of demands. The first is the class of all demand functions, we show that this class has infinite shattering dimension (we give two proofs) and is therefore not PAC learnable. The second class we consider is the class of demand functions derived from utilities with bounded support and income-Lipschitz. We show that the class has a finite fat shattering dimension that depends on the support and the income-Lipschitz constant. Theorem 5. Let C be a set of demand functions from Rd + to Rd + then fatC(γ) = ∞ Proof 1: For ε > 0 let pi = 2−i p for i = 1, . . . , n be a set of price vectors inducing parallel budget sets Bi and let x1, . . . , xn be the intersection of these hyperplanes with an orthogonal line passing through the center. Let H0 and H1 be hyperplanes that are not parallel to p and let xi ∈ Bi ∩ (xi + H+ 0 ) and xi ∈ Bi ∩ (xi + H− 1 ) for i = 1 . . . n (see figure 1). For any labeling b = (b1, . . . , bn) ∈ {0, 1}n let y = y(b) = (y1, . . . , yn) be a set of demands such that yi = xi if bi = 0 and yi = xi if bi = 1 (we omit an additional index b in y for notational convenience). To show that p1, . . . , pn is shattered it suffices to find for every b a demand function fb supported by concave utility such that fb(pi) = yb i . To show that such a function exists it suffices to show that Afriat"s conditions are satisfied. Since yi are in the budget 40 set yi · 2−i p = 1, therefore pi · (yj − yi) = 2j−i − 1. This shows that pi · (yj − yi) ≤ 0 iff j < i hence there can be no negative cycles and the condition is met. ✷ Proof 2: The utility functions satisfying Afriat"s condition in the first proof could be trivial assigning the same utility to xi as to xi . In fact, pick a utility function whose level sets are parallel to the budget constraint. Therefore the shattering of the prices p1, . . . , pn is the result of indifference rather than genuine preference. To avoid this problem we reprove the theorem by constructing utility functions u such that u(xi) = u(xi ) for all i and therefore a distinct utility function is associated with each labeling. For i = 1, . . . n let pi1, . . . , pid be price vectors satisfying the following conditions: 1. the budget sets Bs i are supporting hyperplanes of a convex polytope Λi 2. yi is a vertex of Λi 3. ||yj ||1 · ||pis − pi||∞ = o(1) for s = 1, . . . d and j = 1, . . . , n Finally let yi1, . . . , yid be points on the facets of Λi that intersect yi, such that ||pjr||1 · ||yi − yis||∞ = o(1) for all j, s and r. We call the set of points yi, yi1, . . . , yid the level i demand and pi, pi1, . . . , pid level i prices. Applying H¨olders inequality we get |pir · yjs − pi · yj | ≤ |(pir − pi) · yj| + |pir · (yjs − yj)| ||pir − pi||∞ · ||yj ||1 + ||yjs − yj||∞ · ||pir||1. = o(1) This shows that pir · (yjs − yir) = pi · (ys − yi) + o(1) = 2j−i − 1 + o(1) therefore pir · (yjs − yir) ≤ 0 iff j < i or i = j. This implies that if there is a negative cycle then all the points in the cycle must belong to the same level. The points of any one level lie on the facets of a polytope Λi and the prices pis are supporting hyperplanes of the polytope. Thus, the polytope defines a utility function for which these demands are utility maximizing. The other direction of Afriat"s theorem therefore implies there can be no negative cycles within points on the same level. It follows that there are no negative cycles for the union of observations from all levels hence the sequence of observations (y1, p1), (y11, p11), (y12, p12), . . . , (ynd, pnd) is consistent with monotone concave utility function maximization and again by Afriat"s theorem there exists u supporting a demand function fb ✷ The proof above relies on the fact that an agent have high utility and marginal utility for very large bundles. In many cases it is reasonable to assume that the marginal for very large bundles is very small, or even that the utility or the marginal utility have compact support. Unfortunately, rescaling the previous example shows that even a compact set may contain a large shattered set. We notice however, that in this case we obtain a utility function that yield demand functions that are very sensitive to small price changes. We show that the class of utility functions that have marginal utilities with compact support and for which the relevant demand functions are income-Lipschitzian has finite fat shattering dimension. ✲ ✻ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ (0,0)               H0               H1r x1 r x1 ❈ ❈ ❜ ❜❜ r x2 r x2 ❚ ❚ ❚ ❚ ❚ ❚ ❚ ❚ ❚ ❚ ❚ ❜ ❜❜ Figure 1: Utility function shattering x1 and x2 Theorem 6. Let C be a set of L-income-Lipschitz demand functions from ∆d to Rd + for some global constant L ∈ R. Then fatC(γ) ≤ ( L γ )d Proof : Let p1, . . . , pn ∈ ∆d be a shattered set with witnesses x1, . . . , xn ∈ Rd +. W.l.g. xi+H+ 0 ∩xj +H− 0 = ∅ implying xi + H− 1 ∩ xj + H+ 1 = ∅, for a labeling b = (b1, . . . , bn) ∈ {0, 1}n such that bi = 0 and bj = 1, ||fb(pi) − fb(pj)||∞ > γ hence ||pi − pj||∞ > γ L . A standard packing argument implies n ≤ (L γ )d ✷ 6. ACKNOWLEDGMENTS The authors would like to thank Eli Shamir, Ehud Kalai, Julio Gonz´alez D´ıaz, Rosa Matzkin, Gad Allon and Adam Galambos for helpful discussions and suggestions. 7. REFERENCES [1] Afriat S. N. (1967) The Construction of a Utility Function from Expenditure Data International Economic Review 8, 67-77. [2] Anthony M. and Bartlett P. L. (1999) Neural Network Learning: Theoretical Foundations Cambridge University Press. [3] Blundell R., Browning M. and Crawford I. (2003) Nonparametric Engel curves and revealed preference. Econometrica, 71(1):205-240. [4] Blundell R. (2005 ) How revealing is revealed preference? European Economic Journal 3, 211 - 235. [5] Diewert E. (1973) Afriat and Revealed Preference Theory Review of Economic Studies 40, 419 - 426. [6] Farkas J. (1902) ¨Uber die Theorie der Einfachen Ungleichungen Journal f¨ur die Reine und Angewandte Mathematik 124 1-27 [7] Houthakker H. (1950) Revealed Preference and the Utility Function Economica 17, 159 - 174. [8] Kearns M. and Vazirani U. (1994) An Introduction to Computational Learning Theory The MIT Press Cambridge MA. 41 [9] Knoblauch V. (1992) A Tight Upper Bound on the Money Metric Utility Function. The American Economic Review, 82(3):660-663. [10] Mas-Colell A. (1977) The Recoverability of Consumers" Preferences from Market Demand. Econometrica, 45(6):1409-1430. [11] Mas-Colell A. (1978) On Revealed Preference Analysis. The Review of Economic Studies, 45(1):121-131. [12] Mas-Colell A., Whinston M. and Green J. R. (1995) Microeconomic Theory Oxford University Press. [13] Matzkin R. and Richter M. (1991) Testing Strictly Concave Rationality. Journal of Economic Theory, 53:287-303. [14] Papadimitriou C. H. and Steiglitz K. (1982) Combinatorial Optimization Dover Publications inc. [15] Richter M. (1966) Revealed Preference Theory. Econometrica, 34(3):635-645. [16] Uzawa H. (1960 ) Preference and rational choice in the theory of consumption. In K. J. Arrow, S. Karlin, and P. Suppes, editors, Mathematical Models in Social Science Stanford University Press, Stanford, CA. [17] Teo C. P. and Vohra R. V. (2003) Afriat"s Theorem and Negative Cycles Working Paper [18] Samuelson P. A. (1948) Consumption Theory in Terms of Revealed Preference Economica 15, 243 - 253. [19] Vapnik V. N. (1998) Statistical Learning Theory John Wiley & Sons Inc. [20] Varian H. R. (1982) The Non-Parametric Approach to Demand Analysis Econometrica 50, 945 - 974. [21] Varian H. R. (2005) Revealed Preference, In Michael Szenberg editor, Samuelson Economics and the 21st Century. [22] Ziegler G. M. (1994) Lectures on Polytopes Springer. 42