{"title": "Data center cooling using model-predictive control", "book": "Advances in Neural Information Processing Systems", "page_first": 3814, "page_last": 3823, "abstract": "Despite impressive recent advances in reinforcement learning (RL), its deployment in real-world physical systems is often complicated by unexpected events, limited data, and the potential for expensive failures. In this paper, we describe an application of RL \u201cin the wild\u201d to the task of regulating temperatures and airflow inside a large-scale data center (DC). Adopting a data-driven, model-based approach, we demonstrate that an RL agent with little prior knowledge is able to effectively and safely regulate conditions on a server floor after just a few hours of exploration, while improving operational efficiency relative to existing PID controllers.", "full_text": "Data center cooling using model-predictive control\n\nNevena Lazic, Tyler Lu, Craig Boutilier, Moonkyung Ryu\n\nGoogle Research\n\n{nevena, tylerlu, cboutilier, mkryu}@google.com\n\nEehern Wong, Binz Roy, Greg Imwalle\n\nGoogle Cloud\n\n{ejwong, binzroy, gregi}@google.com\n\nAbstract\n\nDespite the impressive recent advances in reinforcement learning (RL) algorithms,\ntheir deployment to real-world physical systems is often complicated by unexpected\nevents, limited data, and the potential for expensive failures. In this paper, we\ndescribe an application of RL \u201cin the wild\u201d to the task of regulating temperatures\nand air\ufb02ow inside a large-scale data center (DC). Adopting a data-driven, model-\nbased approach, we demonstrate that an RL agent with little prior knowledge is\nable to effectively and safely regulate conditions on a server \ufb02oor after just a few\nhours of exploration, while improving operational ef\ufb01ciency relative to existing\nPID controllers.\n\n1\n\nIntroduction\n\nRecent years have seen considerable research advances in reinforcement learning (RL), with algo-\nrithms achieving impressive performance on game playing and simple robotic tasks [24, 29, 27].\nHowever, applying RL to the control of real-world physical systems is complicated by unexpected\nevents, safety constraints, limited observations and the potential for expensive or even catastrophic\nfailures. In this paper, we describe an application of RL to the task of data center (DC) cooling.\nDC cooling is a test bed that is well-suited for RL deployment because it involves control of a com-\nplex, large-scale dynamical system, non-trivial safety constraints and the potential for considerable\nimprovements in energy ef\ufb01ciency.\nCooling is a critical part of DC infrastructure, since multiple servers operating in close proximity\nproduce a considerable amount of heat and high temperatures may lead to lower IT performance\nor equipment damage. There has been considerable progress in improving cooling ef\ufb01ciency, and\nbest-practice physical designs are now commonplace in large-scale DCs [7]. However, on the\nsoftware side, designing resource-ef\ufb01cient control strategies is still quite challenging, due to complex\ninteractions between multiple non-linear mechanical and electrical systems. Most existing controllers\ntend to be fairly simple, somewhat conservative, and hand-tuned to speci\ufb01c equipment architectures,\nlayouts, and con\ufb01gurations. This leaves potential for ef\ufb01ciency improvement and automation using\nmore adaptive, data-driven techniques.\nAs the number of DCs increases with the adoption of cloud-based services, data growth, and hardware\naffordability, power management is becoming an important challenge in scaling up. In 2014, DCs\naccounted for about 1.8% of the total power usage in the U.S. and about 626 billion liters of water\nwere consumed by DC operations [28]. There has been increased pressure to improve operational\nef\ufb01ciency due to rising energy costs and environmental concerns. This includes cooling, which\nconstitutes a non-trivial part of the DC power overhead.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRecently, DeepMind demonstrated that it is possible to improve DC power usage ef\ufb01ciency (PUE)\nusing a machine learning approach [13]. In particular, they developed a predictive model of PUE in a\nlarge-scale Google DC, and demonstrated that it can be improved by manipulating the temperature of\nthe water leaving the cooling tower and chilled water injection setpoints. In this work, we focus on a\ncomplementary aspect of DC cooling: regulating the temperature and air\ufb02ow inside server \ufb02oors by\ncontrolling fan speeds and water \ufb02ow within air handling units (AHUs).\nOur approach to cooling relies on model-predictive control (MPC). Speci\ufb01cally, we learn a linear\nmodel of the DC dynamics using safe, random exploration, starting with little or no prior knowledge.\nWe subsequently recommend control actions at each time point by optimizing the cost of model-\npredicted trajectories. Rather than executing entire trajectories, we re-optimize at each time step. The\nresulting system is simple to deploy, as it does not require historical data or a physics-based model.\nThe main contribution of the paper is to demonstrate that a controller relying on a coarse-grained\nlinear dynamics model can safely, effectively, and cost-ef\ufb01ciently regulate conditions in a large-scale\ncommercial data center, after just a few hours of learning and with minimal prior knowledge. By\ncontrast, characterizing and con\ufb01guring the cooling control of a new data center \ufb02oor typically takes\nweeks of setup and testing using existing techniques.\n\n2 Background and related work\n\nAmong approaches in the literature, the most relevant to our problem is linear quadratic (LQ) control.\nHere it is assumed that system dynamics are linear and the cost is a quadratic function of states and\ncontrols. When the dynamics are known, the optimal policy is given by constant linear state feedback\nand can be solved ef\ufb01ciently using dynamic programming. In the case of unknown dynamics, open-\nloop strategies identify the system (i.e., learn the parameters of a dynamics model) in a dedicated\nexploration phase, while closed-loop strategies control from the outset, updating models along the\nway [20].\nThe simplest closed-loop approach, known as certainty equivalence, updates the parameters of the\ndynamics model at each step and applies the control law as if the estimated model were the ground\ntruth. This strategy is unable to identify the system in general: parameters may not converge, or may\nconverge to the wrong model, leading to strictly suboptimal control [6]. More recent approaches\n[8, 2, 17] use optimism in the face of uncertainty, where at each iteration the algorithm selects\n\u221a\nthe dynamics with lowest attainable cost from some con\ufb01dence set. While optimistic control is\nasymptotically optimal [8] and has a \ufb01nite-time regret bound of O(\nT ) [2], it is highly impractical\nas \ufb01nding the lowest-cost dynamics is computationally intractable. Similar regret bounds can be\nderived using Thompson sampling in place of optimization [3, 4, 25], but most of these approaches\nmake unrealistic stability assumptions about the intermediate controllers, and can in practice induce\ndiverging state trajectories in early stages.\nIn the open-loop setting, critical issues include the design of exploratory inputs and estimation error\nanalysis. Asymptotic results in linear system identi\ufb01cation (see [21]) include one simple requirement\non the control sequence, persistence of excitation [5]. A review of frequency-domain identi\ufb01cation\nmethods is given in [10], while identi\ufb01cation of auto-regressive time series models is covered in [9].\nNon-asymptotic results are limited and often require additional stability assumptions [16, 14]; most\nrecently, Dean et al. [11] have related the estimation error to the smallest eigenvalue of the \ufb01nite-time\ncontrollability Gramian.\nIn the presence of constraints on controls or states, the optimal LQ controller is no longer given by\nlinear feedback, and it is usually simpler to directly optimize control variables. In model-predictive\ncontrol, the controller generates actions at each step by optimizing the cost of a model-predicted\ntrajectory. Re-optimizing at each time step mitigates the impact of model error and unexpected\ndisturbances at the expense of additional computation. MPC has previously been used to regulate\nbuilding cooling [18, 22, 23, 13, 12], with most approaches relying on historical data and physics-\nbased models. In the context of DC cooling, MPC has been used to control adaptive-vent \ufb02oor tiles in\naddition to air-conditioning units, with system identi\ufb01cation performed via random exploration [30].\nIn this work, we develop a similar control strategy that relies on open-loop linear system identi\ufb01cation,\nfollowed by MPC. We demonstrate that our system can successfully control temperatures and air\ufb02ow\nin a large-scale DC after only a few hours of safe, randomized exploration.\n\n2\n\n\fFigure 1: Data center cooling loop. AHUs on the server \ufb02oor regulate the air temperature through\nair-water heat exchange. Warmed water is cooled in the chiller and evaporative cooling towers.\n\n3 Data center cooling\n\nFigure 1 shows a schematic of the cooling loop of a typical DC. Water is cooled to sub-ambient\ntemperatures in the chiller and evaporative cooling towers, and then sent to multiple air handling\nunits (AHUs) on the server \ufb02oor. Server racks are arranged in rows between alternating hot and cold\naisles. All hot air exhausts into the adjacent hot aisles, which are typically isolated using a physical\nbarrier to prevent hot and cold air from mixing. The AHUs circulate air through the building; hot air\nis cooled through air-water heat exchange in the AHUs, and blown into the cold aisle. The generated\nwarm water is sent back to the chiller and cooling towers. Naturally, variations of this setup exist.\nOur focus is on \ufb02oor-level cooling, where the primary goal is to regulate cold-aisle temperatures and\ndifferential air pressures. Controlling the cold-aisle temperatures ensures that the machines operate at\noptimal ef\ufb01ciency and prevents equipment damage. Maintaining negative differential air pressure\nbetween adjacent hot and cold aisles ensures that cool air \ufb02ows over servers and improves power\nef\ufb01ciency by minimizing the need for the servers to use their own fans. Our goal is to operate close\nto (but not exceeding) upper bounds on temperature and pressure at minimal AHU power and water\nusage. Variables relevant to this problem are continuous-valued, and can be grouped as follows:\n\n\u2022 Controls are the variables we can manipulate. These are fan speed (controlling air \ufb02ow) and\n\nvalve opening (which regulates amount of water used) for each AHU.\n\n\u2022 States collect the process variables we wish to predict and regulate. These include differential\nair pressure (DP) and cold-aisle temperature (CAT), measured using multiple sensors along\nthe server racks. To reduce redundancy and increase robustness to failed sensors, we model\nand regulate the median values of local groups of CAT and DP sensors. We also measure\nthe entering air temperature (EAT) of the hot air entering each AHU, and leaving air\ntemperature (LAT) of the cooled air leaving each AHU.\n\n\u2022 Disturbances are the events and conditions which we cannot manipulate or control, but\nwhich nonetheless affect the conditions inside the server \ufb02oor. These include server power\nusage, which serves as a proxy for the amount of generated heat, as well as the entering\nwater temperature (EWT) of the chilled water measured at each AHU.\n\nAn illustrative schematic of the structure of the DC used in our case study is shown in Figure 1.\nThe system consists of many dozens of AHUs, with two controls each, and many dozens of state\nvariables for each row. The existing cooling system relies on local PID controllers (one per AHU),\nwhich are manually tuned and regulate DP measured at nearby sensors and LAT measured at the\nsame AHU. Directly controlling CAT (the variable of interest) instead of LAT is more complicated,\nas temperatures along the server racks take a longer time to respond to changes in controls and\ndepend on multiple AHUs. Since the local controllers operate independently, they may settle into a\nsuboptimal state where some AHUs do little work while others run at their maximum capacity to\ncompensate. This is addressed using a supervisory software layer which heuristically readjusts local\ncontrols to operate in a more balanced state.\n\n3\n\nwarm water (LWT)chillercooling towerscold water (EWT)hot air (EAT)cold air (LAT)AHUsCAT, DP sensors\fFigure 2: Model structure illustration. Sensor measurements at each location only depend on the\nclosest AHUs. The regularity of DC layout allows parameters to be shared between local models\nwith the same structure (arrows with the same color share weights).\n\nTable 1: State variable dependencies\n\nVariable\nDP\nLAT\nCAT\nEAT\n\nPredictors\nDP measurements and fan speeds in up to 5 closest aisles / 10 closest AHUs\nLAT, EWT, EAT, fan speed, and valve position at the closest AHU\nCAT, LAT, and fan speeds in up to 3 closest aisles / 6 closest AHUs\nEAT, CAT, fan speeds, and power usage at up to 3 closest aisles / 6 closest AHUs\n\n4 Model predictive control\n\nWe consider the use of MPC to remove some of the inef\ufb01ciencies associated with the existing PID\ncontrol system. We: (i) model the effect of each AHU on state variables in a large neighborhood\n(up to 5 server rows) rather than on just the closest sensors; (ii) control CAT directly rather using\nLAT as a proxy; and (iii) jointly optimize all controls instead of using independent local controllers.\nWe identify a model of DC cooling dynamics using only a few hours of exploration and minimal\nprior knowledge. We then control using this learned model, removing the need for manual tuning. As\nwe show, these changes allow us to operate at less conservative setpoints and improve the cooling\noperational ef\ufb01ciency.\n\n4.1 Model structure\n\nLet x[t], u[t], and d[t] be the vectors of all state, control, and disturbance variables at time t,\nrespectively. We model data center dynamics using a linear auto-regressive model with exogeneous\nvariables (or ARX model) of the following form:\n\nT(cid:88)\n\nx[t] =\n\nT(cid:88)\n\nAkx[t \u2212 k] +\n\nBku[t \u2212 k] + Cd[t \u2212 1] .\n\n(1)\n\nk=1\n\nk=1\n\nwhere Ak, Bk, and Ck are parameter matrices of appropriate dimensions. Since we treat sensor\nobservations as state variables, our model is T -Markov to capture relevant history and underlying\nlatent state. Each time step corresponds to a period of 30s, and we set T = 5 based on cross-validation.\nWhile the true dynamics are not linear, we will see that a linear approximation in the relevant region\nof state-action space suf\ufb01ces for effective control.\nWe use prior knowledge about the DC layout to impose a block diagonal-like sparsity on the learned\nparameter matrices. The large size of the server \ufb02oor allows us to assume that temperatures and DPs\nat each location directly depend only states, controls, and disturbances at nearby locations (i.e., are\nconditionally independent of more distant sensors and AHUs given the nearby values).1 Additional\nparameter sparsity can be imposed based on variable types; for example, we know that DP directly\n\n1In other words, the nearby sensors and controls form a Markov blanket [26] for speci\ufb01c variables in a\n\ngraphical model of the dynamical system.\n\n4\n\n\fFigure 3: An example run of random exploration, followed by control. The \ufb01gure shows valve\ncommands and fan speeds for all AHUs, as well as the CAT and DP sensor values at multiple\nlocations throughout the DC. The system controls DP at a setpoint xDP\nsp . CAT control starts at setpoint\nsp \u2212 1 followed by xCAT\n; the temperatures transition between the two values quickly and with\nxCAT\nlittle overshoot.\n\nsp\n\ndepends on the fan speeds, but is (roughly) independent of temperature within a narrow temperature\nrange. We list the features used to predict each state variable in Table 1.\nSince the servers, sensors, and cooling hardware are arranged in a regular physical layout in the DC\nwe work with, we share parameters between local models for sample ef\ufb01ciency. Thus, our model has\nan overall linear convolutional structure, as illustrated in Figure 2.\n\n4.2 System identi\ufb01cation\n\nWe learn the system dynamics using randomized exploration over controls, starting with a \u201cvacuous\u201d\nmodel that predicts no change in states. While we have access to historical data generated by the\nlocal PID controllers, it is not suf\ufb01ciently rich to allow for system identi\ufb01cation due to the steady\nstate behaviour of the controllers.2 During the control phase we continue to update the dynamics in\nan online or batch-online fashion.\nAs safe operation during exploration is critical, we limit each control variable to a safe range informed\nby historical data. In the absence of such data, the safe range can be initialized conservatively and\ngradually expanded. We also limit the maximum absolute changes in fan and valve controls between\ni [t] indicate the\nconsecutive time steps since large changes may degrade hardware over time. Let uc\nvalue of the control variable c for the ith AHU at time step t, with c \u2208 {fan, valve}. Let [uc\nmax]\nbe the range of control variable c, and let \u2206c be the maximum allowed absolute change in c between\nconsecutive time steps. Our exploration strategy is a simple, range-limited uniform random walk in\neach control variable:\n\nmin, uc\n\ni [t] + vc\n\ni )), vc\n\ni \u223c Uniform(\u2212\u2206c, \u2206c).\n\nmax, uc\n\nmin, min(uc\n\nuc\ni [t + 1] = max(uc\n\n(2)\nThis strategy ensures suf\ufb01cient frequency content for system identi\ufb01cation and respects safety and\nhardware constraints. Figure 3 shows controls and states during an example run of random exploration,\nfollowed by control.\nDuring the exploration phase, we update model parameters using recursive least squares [15]. In the\ncontrol phase, we update parameters selectively so as not to overwhelm the model with steady-state\ndata. In particular, we estimate the noise standard deviation \u03c3s for each variable s as the root\nmean squared error on the training data, and update the model with an example only if its (current)\nprediction error exceeds 2\u03c3s.3\n\n2Speci\ufb01cally, the PID controllers operate in too narrow a range of (joint) state-control space to generate data\n\nallowing suf\ufb01ciently accurate prediction in novel regions.\n\n3In long running operation, triggering further exploration to account for rare exogenous conditions or\n\ndisturbances (as well as drift) may be necessary, but we don\u2019t consider this here.\n\n5\n\n32101xCAT\u2212xCATsp(C)304050607080uvalve(%)1h2h0.010.000.01xDP\u2212xDPsp(wg)1h2h304050607080ufan(%)\f4.3 Control\n\nGiven our model and an initial condition (the T past states, controls, and disturbances for the M\nAHUs), we optimize the cost of a length-L trajectory with respect to control inputs at every step.\nsp denote the setpoint (or target value) for a state variable s, where s \u2208 {DP, CAT, LAT}. Let\nLet xs\ni [t] denote the value of the state variable s for the ith AHU at time t. We set controls by solving the\nxs\nfollowing optimization problem:\n\nmin\n\nu\n\ns.t. uc\n\n\u03c4 =t\n\ni=1\n\n(cid:19)\n\nqs(xs\n\nsp)2 +\n\n(cid:18)(cid:88)\nt+L(cid:88)\nM(cid:88)\ni \u2208 [uc\nT(cid:88)\n\n(cid:88)\ni [\u03c4 \u2212 1]| \u2264 \u2206c, d[\u03c4 ] = d[\u03c4 \u2212 1]\nBku[\u03c4 \u2212 k] + Cd[\u03c4 \u2212 1]\nx[\u03c4 ] =\nt \u2264 \u03c4 \u2264 t + L, c \u2208 {fan, valve}, s \u2208 {DP, CAT, LAT}.\n\ni [\u03c4 ] \u2212 xs\ni [\u03c4 ] \u2212 uc\n|uc\nT(cid:88)\n\nmax],\nAkx[\u03c4 \u2212 k] +\n\ni [\u03c4 ] \u2212 uc\n\ns\nmin, uc\n\nmin)2\n\nk=1\n\nk=1\n\nrc(uc\n\nc\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nHere qs and rc are the weights for the loss w.r.t. state and control variables s and c, respectively,\nand i ranges over AHUs. We assume that disturbances do not change over time. Overall, we have a\nquadratic optimization objective in 2M L (cid:39) 1.2K variables, with a large number of linear and range\nconstraints. While we optimize over the entire trajectory, we only execute the optimized control\naction at the \ufb01rst time step. Re-optimizing at each step enables us to react to changes in disturbances\nand compensate for model error.\nWe specify the above objective as a computation graph in TensorFlow [1] and optimize controls using\nthe Adam [19] algorithm. In particular, we implement constraints by specifying controls as\n\nuc\ni [\u03c4 ] = max(uc\n\n(7)\ni [\u03c4 ] is an unconstrained optimization variable. The main motivation for this choice is its\n\nwhere zc\nsimplicity and speed\u2014the optimization converges well before our re-optimization period of 30s.\n\nmin, min(uc\n\nmax, uc\n\ni [\u03c4 ])))\n\ni [\u03c4 \u2212 1] + \u2206ctanh(zc\n\n5 Experiments\n\nWe evaluate the performance of our MPC approach w.r.t. the existing local PID method on a large-\nscale DC. Since the quality of MPC depends critically on the quality of the underlying model, we \ufb01rst\ncompare our system identi\ufb01cation strategy to two simple alternatives. One complication in comparing\nthe performance of different methods on a physical system is the inability to control environmental\ndisturbances which affect the achievable costs and steady-state behavior. In our DC cooling setting,\nthe main disturbances are the EWT (temperature of entering cold water) and server power usage (a\nproxy for generated heat). These variables also re\ufb02ect other factors (e.g., weather, time of day, server\nhardware, IT load). To facilitate a meaningful comparison, we evaluate the cost of control (i.e., cost\nof power and water used by the AHUs) for different ranges of states and disturbances.\n\n5.1 System identi\ufb01cation\n\nWe \ufb01rst evaluated our system identi\ufb01cation strategy by comparing the following three models:\n\nindependent random walks limited to a safe range as described in Section 4.2.\n\n\u2022 Model 1: our model, trained on 3 hours of deliberate exploration data with controls following\n\u2022 Model 2: trained on a week of historical data generated by local PID controllers. While\nthis model is trained on 56 times more data than the others, it turns out that the data is\nnot as informative. Since local controllers regulate LAT to a \ufb01xed offset above EWT, the\nmodel may simply learn this relationship rather than the dependence of LAT on controls.\nFurthermore, if state values do not vary much, it may learn to predict no changes in state.\n\u2022 Model 3: trained on 3 hours of data with controls recommended by a certainty-equivalent\ncontroller (i.e., optimal controls w.r.t. all available data at each iteration, see Section 2),\nlimited to a safe range. While this data contains a wider range of inputs than Model 2 data,\nit contains no exploratory actions.\n\n6\n\n\fFigure 4: Histograms of state variables and disturbances over time and AHUs during steady-state\noperation of MPC controllers using three different models.\n\nTable 2: Average power and water cost (% data) for each controller, restricted to time points and\nAHUs for which CAT was within 0.25C of xCAT\nsp , strati\ufb01ed by\nvalues of the disturbances.\n\nand pressure within 0.004 of xDP\n\nsp\n\nEntering water\ntemperature (C)\n\u2264 20.5\n> 20.5\n\u2264 20.5\n> 20.5\nany\n\nServer load\n(fraction of max)\n\u2264 0.7\n\u2264 0.7\n> 0.7\n> 0.7\nany\n\nModel 1\ncost (% data)\n84.3 (31%)\n85.8 (17.6 %)\n142.4 (21.9 %)\n144.6 (15.3 %)\n110.2 (85.8%)\n\nModel 2\ncost (% data)\n94.4 (29.9%)\n93.8 (14.1 %)\n149.4 (20.4 %)\n148.9(12.8 %)\n117.9 (77.2%)\n\nModel 3\ncost (% data)\n99.6 (13.7%)\n112.7 (36.0 %)\n178.2 (8.3 %)\n182 .1 (29.9 %)\n140.4 (87.9%)\n\nsp\n\nand xDP\n\nWe controlled median CAT and DP at setpoints xCAT\nsp , using each model for approximately\none day. We examine the steady state behavior of the controllers next. Figure 4 shows histograms\nof states and disturbances during the operation of the three controllers, with data aggregated over\nboth time and sensors. In all three cases, state variables remain close to their targets most of the\ntime, but the controller based on Model 2 (historical data) had the highest steady-state error (e.g.,\nthe difference between CAT/DP and their setpoints is close to zero less often with Model 2). The\ndistribution of server loads during the three tests was similar, while the EWT was somewhat higher\nfor Model 3. The average cost of controls (fan power and water usage in the objective) was 115.9,\n116.6, and 139.9, respectively; however, these are not directly comparable due to differences in steady\nstate error and disturbances.\nStratifying data by state and disturbance values is somewhat complicated. For example, sensor\nmeasurements at any location are affected by multiple AHUs with different EWTs. Similarly, each\nAHU affects measurements at multiple racks with different loads. To simplify analysis, we treat each\ngroup of sensors as dependent on its closest AHU, allowing independent consideration of each AHU.\nA lesser complication is the time lag between control changes and state changes. Since controllers\nlargely operate in steady state, controls do not change often, so we consider time points independently.\nTo compare costs, we \ufb01rst restrict available data to time points and AHUs where temperatures were\nwithin 0.25C of xCAT\nsp (i.e., the intersection of histogram peaks\nin Figure 4, left). This corresponded to 85.8%, 77.2%, and 87.9% of the data for controllers using\nModels 1, 2, and 3, respectively. We then strati\ufb01ed the data by different ranges of EWT and server\nload. We evaluated the control cost for each disturbance range. The results are shown in Table 2, and\nsuggest that the controller based on Model 1 (with explicit exploration data) is the most ef\ufb01cient.\n\n, and pressures within 0.004 of xDP\n\nsp\n\n5.2 Comparison to local PID controllers\n\nThe existing local PID controllers differ from ours in that they regulate LAT to a constant offset\nrelative to EWT, rather than controlling CAT directly. To compare the two approaches, we ran our\nMPC controller with the same LAT-offset setpoints for one day, and compared it to a week of local\nPID control. As before, we treat measurements at each group of sensors as depending only on the\nclosest AHU, and ignore time lags (assuming reasonable control consistency during steady-state\noperation). Histograms of states and disturbances during the operation of the two controllers are\n\n7\n\n-0.250.250.750.00.20.40.60.81.0xCAT\u2212xCATsp(C)-0.012-0.0040.0040.0120.00.20.40.60.81.0xDP\u2212xDPsp(wg)20.020.521.00.00.10.20.30.40.50.60.7dEWT(C)0.650.70.750.80.850.000.050.100.150.200.250.300.350.400.45dload(frac.max)Model 1Model 2Model 3\fFigure 5: Histograms of state variables and disturbances over time and AHUs during steady-state\noperation of the MPC (Model 1) and local PID controllers.\n\nTable 3: Average total cost (% data) of each controller, restricted to time points and fancoils for which\nLAT-EWT and DP were within 0.25C and 0.004wg of their respective setpoints, strati\ufb01ed by values\nof the disturbances.\n\nEntering water\ntemperature (C)\n\u2264 20.5\n> 20.5\n\u2264 20.5\n> 20.5\nany\n\nServer load Local controllers Model 1\n(frac. max)\n\u2264 0.7\n\u2264 0.7\n> 0.7\n> 0.7\nany\n\ncost (% data)\n95.3 (19.8 %)\n107.9 (13.8 %)\n170.3 (20.1%)\n187.8 (20.4 %)\n142.2 (74.4%)\n\ncost (% data)\n106.4 (22.6 %)\n104.9 (15.0 %)\n130.5 (18.8 %)\n128.7 (18.0 %)\n116.7 (74.1%)\n\nshown in Figure 5. Local controllers track the temperature setpoint more closely, but operate at higher\nDP. Server loads are similar, while average EWT is lower during local controller operation.\nTo compare costs, we restrict data to AHUs and times corresponding to the peaks of histograms in\nFigure 5 left (about 74% of the data for both controllers). We stratify this data as above and compare\nthe total cost in each stratum in Table 3. While local control was more cost ef\ufb01cient under low EWT\nand server load, our controller was more ef\ufb01cient under all other conditions and overall.\nWhile the quadratic objective is a reasonable approximation, it does not correspond exactly to the\ntrue dollar cost of control, which is not quadratic and may change over time. After restricting to\ntemperatures and pressures as in Tables 3 and 2, the average dollar cost (units unspeci\ufb01ed) of our\nLAT and CAT controllers was 94% and 90.7% of the cost of the local controllers. While precise\nquanti\ufb01cation of these savings requires longer-term experiments, our approach of jointly optimizing\ncontrols of all AHUs, together with the ability to control process variables at slightly higher values,\nhas the potential to save about 9% of the current server-\ufb02oor cooling costs.\n\n6 Discussion\n\nWe have presented an application of model-based reinforcement learning to the task of regulating\ndata center cooling. Speci\ufb01cally, we have demonstrated that a simple linear model identi\ufb01ed from\nonly a few hours of exploration suf\ufb01ces for effective regulation of temperatures and air\ufb02ow on a\nlarge-scale server \ufb02oor. We have also shown that this approach is more cost effective than commonly\nused local controllers and controllers based on non-exploratory data.\nOne interesting question is whether the controller performance could further be improved by using\na higher-capacity model such as a neural network. However, such a model would likely require\nmore than a few hours of exploratory data to identify, and may be more complicated to plan with.\nPerhaps the most promising direction for model improvement is to learn a mixture of linear models\nthat could approximate dynamics better under different disturbance conditions. In terms of overall\ndata center operational ef\ufb01ciency, further advantages are almost certainly achievable achieved by\njointly controlling AHUs and the range of disturbance variables if possible, or by planning AHU\ncontrol according to known/predicted disturbances values rather than treating them as noise.\n\n8\n\n-0.250.250.750.00.20.40.60.81.0(xLAT\u2212xEWT)\u2212x(LAT\u2212EWT)sp(C)-0.012-0.0040.0040.0120.00.10.20.30.40.50.60.70.80.9xDP\u2212xDPsp(wg)20.020.521.00.00.10.20.30.40.50.6dEWT(C)0.650.70.750.80.850.000.050.100.150.200.250.300.350.40dload(frac.max)Model 1Local\fAcknowledgments\n\nThe experiments performed for this paper would not have been possible without the help of many\npeople. We would especially like to thank Dave Barker, Charisis Brouzioutis, Branden Davis, Orion\nFox, Daniel Fuenf\ufb01nger, Amanda Gunckle, Brian Kim, Eddie Pettis, Craig Porter, Dustin Reishus,\nFrank Rusch, Andy Thompson, and Abbi Ward. We also thank Gal Elidan for many valuable\ndiscussions.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,\nAndy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey\nIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,\nDan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,\nBenoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda\nVi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from\ntensor\ufb02ow.org.\n\n[2] Yasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Regret bounds for the adaptive control of linear quadratic\n\nsystems. In Computational Learning Theory (COLT), 2011.\n\n[3] Yasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Bayesian optimal control of smoothly parameterized systems.\n\nIn Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 1\u201311, 2015.\n\n[4] Marc Abeille and Alessandro Lazaric. Thompson sampling for linear-quadratic control problems. In\n\nAISTATS, 2017.\n\n[5] K. J. Astr\u00f6m. Optimal control of Markov decision processes with incomplete state estimation. J. Math.\n\nAnal. Appl., 10:174\u2013205, 1965.\n\n[6] Karl Johan \u00c5str\u00f6m and Bj\u00f6rn Wittenmark. On self tuning regulators. Automatica, 9(2):185\u2013199, 1973.\n\n[7] Luiz Andr\u00e9 Barroso, Jimmy Clidaras, and Urs H\u00f6lzle. The Datacenter as a Computer: An Introduction to\nthe Design of Warehouse-scale Machines, 2nd Edition. Morgan & Claypool Publishers, 2013. Synthesis\nLectures on Computer Architecture 8:3.\n\n[8] Sergio Bittanti, Marco C Campi, et al. Adaptive control of linear time invariant systems: the \"bet on the\n\nbest\" principle. Communications in Information & Systems, 6(4):299\u2013320, 2006.\n\n[9] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis:\n\nforecasting and control. John Wiley & Sons, 2015.\n\n[10] Jie Chen and Guoxiang Gu. Control-oriented system identi\ufb01cation: an H\u221e approach, volume 19. Wiley-\n\nInterscience, 2000.\n\n[11] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of\n\nthe linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.\n\n[12] Jingjuan Dove Feng, Frank Chuang, Francesco Borrelli, and Fred Bauman. Model predictive control of\n\nradiant slab systems with evaporative cooling sources. Energy and Buildings, 87:199\u2013210, 2015.\n\n[13] Jim Gao. Machine learning applications for data center optimization. Google White Paper, 2014.\n\n[14] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. arXiv\n\npreprint arXiv:1609.05191, 2016.\n\n[15] Monson H. Hayes. Statistical Digital Signal Processing and Modeling. Wiley, 1996.\n\n[16] Arthur J Helmicki, Clas A Jacobson, and Carl N Nett. Control oriented system identi\ufb01cation: a worst-\ncase/deterministic approach in H\u221e. IEEE Transactions on Automatic control, 36(10):1163\u20131176, 1991.\n\n[17] Morteza Ibrahimi, Adel Javanmard, and Benjamin V. Roy. Ef\ufb01cient reinforcement learning for high\ndimensional linear quadratic systems. In Advances in Neural Information Processing Systems 25, pages\n2636\u20132644. Curran Associates, Inc., 2012.\n\n[18] A. Kelman and F. Borrelli. Bilinear model predictive control of a hvac system using sequential quadratic\n\nprogramming. In Proceedings of the 2011 IFAC World Congress, 2011.\n\n9\n\n\f[19] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference\n\non Learning Representations (ICLR), 2015.\n\n[20] Lennart Ljung, editor. System Identi\ufb01cation: Theory for the User (2nd Edition). Prentice Hall, Upper\n\nSaddle River, New Jersey, 1999.\n\n[21] Lennart Ljung and Torsten S\u00f6derstr\u00f6m. Theory and practice of recursive identi\ufb01cation, volume 5. JSTOR,\n\n1983.\n\n[22] Yudong Ma, Francesco Borrelli, Brandon Hencey, Brian Coffey, Sorin Bengea, and Philip Haves. Model\npredictive control for the operation of building cooling systems. IEEE Transactions on control systems\ntechnology, 20(3):796\u2013803, 2012.\n\n[23] Yudong Ma, Anthony Kelman, Allan Daly, and Francesco Borrelli. Predictive control for energy ef\ufb01cient\nbuildings with thermal storage: Modeling, stimulation, and experiments. IEEE Control Systems, 32(1):44\u2013\n64, 2012.\n\n[24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[25] Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with\n\nThompson sampling. arXiv preprint arXiv:1709.04047, 2017.\n\n[26] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan\n\nKaufmann, San Mateo, 1988.\n\n[27] Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric\n\nactor critic for image-based robot learning. CoRR, abs/1710.06542, 2017.\n\n[28] Arman Shehabi, Sarah Josephine Smith, Dale A. Sartor, Richard E. Brown, Magnus Herrlin, Jonathan G.\nKoomey, Eric R. Masanet, Nathaniel Horner, In\u00eas Lima Azevedo, and William Lintner. United states data\ncenter energy usage report. Technical report, Lawrence Berkeley National Laboratory, 2016.\n\n[29] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go\nwith deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[30] Rongliang Zhou, Cullen Bash, Zhikui Wang, Alan McReynolds, Thomas Christian, and Tahir Cader. Data\ncenter cooling ef\ufb01ciency improvement through localized and optimized cooling resources delivery. In\nASME 2012 International Mechanical Engineering Congress and Exposition, pages 1789\u20131796. American\nSociety of Mechanical Engineers, 2012.\n\n10\n\n\f", "award": [], "sourceid": 1895, "authors": [{"given_name": "Nevena", "family_name": "Lazic", "institution": "Google"}, {"given_name": "Craig", "family_name": "Boutilier", "institution": "Google"}, {"given_name": "Tyler", "family_name": "Lu", "institution": "Google"}, {"given_name": "Eehern", "family_name": "Wong", "institution": "Google"}, {"given_name": "Binz", "family_name": "Roy", "institution": "Google"}, {"given_name": "MK", "family_name": "Ryu", "institution": "Google"}, {"given_name": "Greg", "family_name": "Imwalle", "institution": "Google"}]}