The MDPDist Mathematica package

On-the-fly exact computation of bisimilarity distances for Markov Decision Process with rewards

Authors: Giorgio Bacci, Giovanni Bacci, Kim G. Larsen, and Radu Mardare

In this tutorial we present the main features of the MDPDist package. This Mathematica package implements an on-the-fly method for computing the bisimilarity pseudometric of Ferns et al. [1] for Markov Decision Process with rewards. The algorithm implemented in this package is an exension of the on-the-fly method recently proposed in [2] for Markov Chains.

Preliminaries

Markov Decision Processes and Bisimulation

A probability distribution over a nonempty enumerable set S is a function μ : S → [0,1] such that MDP Tutorial_1.gif. We denote by Δ(S) the set of probability distributions over S.

Definition (Markov Decision Process). A Markov Decision Process with rewads (MDP) is a tuple M =(S,A,τ,ρ) consisting of a finite nonempty set S of states, a finite nonempty set A of actions, a transition function τ : S × S → Δ(S), and a reward function τ ∶ S × A → R.

The operational behavior of an MDP M =(S,A,τ,ρ) is as follows: the process in the state MDP Tutorial_2.gif chooses nondeterministically an action a ∈ A and it changes the state to MDP Tutorial_3.gif, with the probabilty τ(MDP Tutorial_4.gif,a)(MDP Tutorial_5.gif). The choice of the action a made in MDP Tutorial_6.gifis rewarded by ρ(MDP Tutorial_7.gif,a). The exectutions are transition sequences w = (MDP Tutorial_8.gif,MDP Tutorial_9.gif),(MDP Tutorial_10.gif,MDP Tutorial_11.gif),....; the challenge is to find strategies for choosing the actions in order to maximize the accumulated reward MDP Tutorial_12.gif, where λ ∈ (0,1) is the discount factor. A strategy is given by a function π : S → Δ(A), called policy, where π(MDP Tutorial_13.gif)(a) is the probabilty distribution over executions defined, for arbitrary w = (MDP Tutorial_14.gif,MDP Tutorial_15.gif),(MDP Tutorial_16.gif,MDP Tutorial_17.gif)...., by MDP Tutorial_18.gif. The value of s according to π is characterized by MDP Tutorial_19.gif, where Ω(w) is the set of executions from s.

Definition (Stocastic Bisimulation). Let M =(S,A,τ,ρ) be an MDP. An equivalence relation R ⊆ S × S is a stocatic bisimulation if whenever (s,t) ∈ R then, for all a ∈ A, ρ(s,a) = ρ(t,a) and, for all R-equivalence classes C, MDP Tutorial_20.gif. Two states s,t ∈ S are stocatic bisimilar, written s ∼ t, if they are related by some stochastic bisimulation on M.

Bisimilarity Pseudometric

To cope with the problem of measuring how similar two MDPs are, Ferns et al. [1] defined a bisimilarity pseudometric that measures the behavioral similarity of two non-bisimilar MDPs. This is defined as the least fixed point of a transformation operator on functions in S × S → [0,∞).
Consider an arbitrary MDP M =(S,A,τ,ρ) and a discount factor λ ∈ (0,1). The set S × S → [0,∞) of [0,∞)-valued maps on S × S equipped with the point-wise partial order defined by d ⊆ d’ ⇔ d(s,t) ≤ d’(s,t), for all s,t ∈ S.
We define a fixed point operator MDP Tutorial_21.gif, for d : S × S → [0,∞) and s,t ∈ S as

MDP Tutorial_22.gif

Where MDP Tutorial_23.gif denotes the optimal value of a Transportation Problem with marginals τ(s,a) and τ(t,a) and costs given by d

MDP Tutorial_24.gif

MDP Tutorial_25.gif is monotonic, thus, by Tarski’s fixed point theorem, its admits a least fixed point. This fixed point is the bisimilarity pseudometric.

Definition (Bisimilarity psudometric). Let M =(S,A,τ,ρ) be an MDP and λ ∈ (0,1) be a discount factor, then the λ-discounted bisimilarity pseudometric for M, written MDP Tutorial_26.gif, is the lest fixed point of MDP Tutorial_27.gif.

The pseudometric MDP Tutorial_28.gif enjoys the property that two states are at distance zero if and only if they are bisimilar.

Getting started

The package can be loaded by setting as current directory the directory of the MDPDist package, then using the commad << (Get) as follows

MDP Tutorial_29.gif

Encoding and manipulating Markov Decision Processes

Encoding an MDP

An MPD M =(S,A,τ,ρ) with MDP Tutorial_30.gif and MDP Tutorial_31.gif is represented as a term of the form MPD[tm,rw,act] where
- tm is an n × m × n probability matrix such that tm[[i,k,j]] = MDP Tutorial_32.gif for all 1 ≤ i,j ≤ n and all 1 ≤ k ≤ m,
- rw is an n × m real-valued matrix such that rw[[i,k]]= MDP Tutorial_33.gif for all 1 ≤ i ≤ n and all 1 ≤ k ≤ m, and
- act is a list of pairwise distinct strings such that act[[k]] represents the action MDP Tutorial_34.giffor all 1 ≤ k ≤ m.
A state MDP Tutorial_35.gif is implicitly represented by its index 1 ≤ i ≤ n.

Assume we want to encode the following MDP M =(S,A,τ,ρ) with S = {1,2,3,4,5,6} and A = {inc, dec}, with the following intuitive operational behavior.  Given M in state i, the choice of the action inc causes the MDP to move in state i+1 with probability 0.8, and to stay in state i or to move in state i-1 both with probability 0.1. The choice of the action dec causes the MDP to move in state i-1 with probability 0.7, to stay in state i with probability 0.1, and to move in state i+1 with probability 0.2. The reward of choising an inc action is always -1, whereas the reward for a dec action is 5 if it is made from state 5 or 6, 0 otherwise.

Mathematica allows to easily define functions and lists, so that we can exploit these features in order to define the data structures that are needed to encode M

MDP Tutorial_36.gif

A represents the set of actions, τ is the transition function and ρ is the reward function.
One can obtain the respective transition and reward matrix using the built-in function Table as follows:

MDP Tutorial_37.gif

MDP Tutorial_38.gif

MDP Tutorial_39.gif

Then M is encoded by

MDP Tutorial_40.gif

Alternatively, one may use the functions MDPtm and MDPrw to construct the transition and the reward matrices listing only the values different from zero. Given the list of rules tmRules = { MDP Tutorial_41.gif, the probabilty transition matrix induced by the probability transition function τ may be constructed by calling MDPtm[tmRules, n, m].
Analogously, for rwRules = { MDP Tutorial_42.gif, one can construct the reward matrix induced by the reward function ρ by calling MDPrw[rwRules, n, m].

The list of rules may be given explicitly or using patterns. For example, the matrices tm and rw defined above can be constructed as follows

MDP Tutorial_43.gif

Manipulating MDPs

An MDP is dispayed by the function PlotMDP. States are represented as white nodes labeled with their respective index. Red edges exiting from each node represent a nondeterministic choice of an action. The chosen action is displayed as a colored node labeled with its associated reward, called choice node. Edges exiting form a choice node represent the probability distribution obtained after a nondeterministic choice. For example, the MDP mdp defined above is dispayed as follows

MDP Tutorial_44.gif

MDP Tutorial_45.gif

Given a sequence MDP Tutorial_46.gif of MDPs, JoinMDP[MDP Tutorial_47.gif] yields an MDP representing their disjoint union, where the indices representing the set of states are obtained shifting the indices of the states of its arguments according to their relative order in the sequence (e.g. if MDP Tutorial_48.gif has n states, the index corresponding to the i-th state of MDP Tutorial_49.gif in JoinMDP[MDP Tutorial_50.gif,MDP Tutorial_51.gif] is n + i).

MDP Tutorial_52.gif

MDP Tutorial_53.gif

Computing Bisimilarity Distances

Given an MDP M, a discount factor λ, and a list of state pairs Q, BDistMDP[M,λ,Q] computes the distance MDP Tutorial_54.gif associated with each pair of states (s,t) in Q. For example, taking MDP mdp considered so far, a discount factor 0.5 and a single-pair query Q = {(1,2)} we have

MDP Tutorial_55.gif

MDP Tutorial_56.gif

while for the query Q = {(1,2),(3,4),(6,5)} we have

MDP Tutorial_57.gif

MDP Tutorial_58.gif

If one is interested in computing the distance between all state pairs, it can use the alias All as follows

MDP Tutorial_59.gif

MDP Tutorial_60.gif

The computation can be traced setting the option Verbose to True. It will display the couplings encoutered during each iteration of the on-the-fly algorithm. This option can be also used to analyze how the algorithm explores the state space of the MDP.

MDP Tutorial_61.gif

One can provide some estimates in order to further reduce the exploration of the state space. This can be done using the option Estimates

MDP Tutorial_62.gif

MDP Tutorial_63.gif

Exact Computations

Mathematica allows one to perform exact computations over Rational numbers. Usually the Bellman-optimality equation system is computed approximately since it is in general a non-linear equation system. Despite this fact, our method is able to compute the exact result when the input MDP is rational-valued (i.e. both the transition fuction and the reward function are rational-valued)

For example, the MDP we considered so far can be rationalized as follows

MDP Tutorial_64.gif

now, the exact value of the distance can be computed setting the option Exact to True

MDP Tutorial_65.gif

MDP Tutorial_66.gif

(Asynchronous) Parallel Composition

Usually, MDPs are specified as the composition of several smaller sub-systems. To this end, several composition operators have been introduced, among them we recall the synchronous (and asynchronous) parallel composition or the probabilistic choice. This operators allows one to describe very large systems implicitly as the composition of more manageble sub-systems.

The BDistMDP function is able to directly handle MDPs defined as the (asyncronous) parallel composition of other systems. Having a sequence of MDPs MDP Tutorial_67.gif, its (asyncronous) parallel composition MDP Tutorial_68.gif can be implicity defined as a term CMDP[enc,{MDP Tutorial_69.gif}] where enc represent the “encoding maps” that associates each composite state its ecoded state.
It is worth noting that CMDP[enc,{MDP Tutorial_70.gif}] is treated as a term and CMDP is a simple data structure that does compute the parallel composition of  MDP Tutorial_71.gif.

Let us consider a simple yet meaningful set of experiments performed on a collection of MDPs, parametric in the probabilities, modeling a pipeline. The following function specifies a an element MDP Tutorial_72.gif of a pipeline with actions MDP Tutorial_73.gif for i ∈ N. A pipeline is modeled as the parallel composition of different processing elements, that are connected in series by measn of synchronization on shared actions.

MDP Tutorial_74.gif

MDP Tutorial_75.gif

MDP Tutorial_76.gif

Let us define some element instances

MDP Tutorial_77.gif

The ecoding map associated with the composite system MDP Tutorial_78.gifhas to be computed using the function getEncodingMaps as follows

MDP Tutorial_79.gif

now the pipeline instance MDP Tutorial_80.gifis encoded by the following term

MDP Tutorial_81.gif

The index of a state of the pipeline MDP Tutorial_82.gifcan be specified as a tuple of the form MDP Tutorial_83.gif where MDP Tutorial_84.gif is a state belonging to MDP Tutorial_85.gif. The ecoding map defined above helps in retrieving the state index corresponding to the composite state. For example, the pair of state indices associated with the pair {{1,1,1},{1,2,1}} is given by

MDP Tutorial_86.gif

MDP Tutorial_87.gif

however, the function BDistMDP handles such composite pairs of states. For example, assuming λ= 1/2, the distance between the composite state {1,1,1} and {1,2,1} in  MDP Tutorial_88.gifis

MDP Tutorial_89.gif

MDP Tutorial_90.gif

Bisimilarity Classes & Quotient

Due to the fact that MDP Tutorial_91.gif(s,t) equals zero if and only if the states s and t are bisimilar, we can use the algorithm for computing MDP Tutorial_92.gif in order to compute the bisimilarity classes and the bisimilarity quotient for an MDP.

The function BisimClassesMDP[M] returns the set of bisimilarity classes of M, that is S/∼.

MDP Tutorial_93.gif

MDP Tutorial_94.gif

MDP Tutorial_95.gif

MDP Tutorial_96.gif

The function BisimQuotientMDP[M] yields an MDP representing the bisimilarity quotient of M.

MDP Tutorial_97.gif

MDP Tutorial_98.gif

Spikey Created with Wolfram Mathematica 9.0