General Approach: Decentralized, Anytime, Local-Repair Scheduling
The ANTS programme deals with coordinating large, distributed networks of resources to accomplish multi-resource, real-time tasks. More detail is provided in the background section; here is a quick summary:
- The coordination process must be scalable in the number of resources.
- It must also provide real-time responses as tasks are generated and as the load on the resources varies.
- Tasks are considered to be of high value, so as many tasks as possible should be accomplished, although true optimality may not be possible given the real-time constraints.
- The coordination process must be tolerant of partial resource and communication failure.
The approach that we take in our e-Merge-ANT project is based on decentralized, anytime, local-repair scheduling:
- Decentralized: scheduling agents1 maintain a distributed schedule
- Each agent has a schedule fragment that dictates the actions that a subset of the resources are to execute, and when they are to do so.
- Anytime: iterative improvement occurs simultaneously with schedule execution
- The scheduling agents continually strive to improve their schedule fragments, but at all times, they have schedules that their resources can execute.
- Local repair:
- As the resources execute the schedule fragments, the scheduling agents communicate with each other and each locally improves its schedule fragment based on the information it gains from the other scheduling agents.
Thus there are two simultaneous, ongoing processes, each of which is fully decentralized: the first process is the maintenance of suitable schedule fragments by the schedulers, and the second process is the execution of the schedule fragments by the resources.
Of course, there may be other ongoing processes: for example, in a network of sensors, a (decentralized) data fusion process will combine the data gathered by the distributed sensors to form some coherent model of the real world which will in turn influence the schedulers (they will adapt the schedules based upon the location of targets that are to be tracked, for example).
Decentralized, anytime, local-repair scheduling is well-suited to coordinating large, distributed resource networks:
- Time-based resource coordination
- The coordination of the resources is achieved by scheduling their actions: for example, if two sensors need to scan the same target simultaneously, their schedules inform them where to look and at what time. Time-based coordination does requires a common idea of time throughout the network, but this is generally required anyway (e.g., for data fusion) and there are well-known methods for maintaining a reasonably accurate common clock with low costs.
- Local interaction - scalability
- Each scheduling agent deals with only a limited number of resources and communicates with only a small number of other scheduling agents, so the per-resource computational, communication and storage costs are essentially constant. In other words, this approach is scalable.
- Real-time response but also high effectiveness
- The anytime iterative improvement scheme allows for a fast initial response to changes (e.g., a new task being generated or a resource failing). The initial response may not be optimal, but it is expected to improve as the schedulers communicate. What is important is that time spent coordinating the resources does not significantly deplete the time the resources have available for accomplishing tasks. In other words, the anytime approach delivers real-time response while also potentially achieving a high proportion of task accomplishment.
- Graceful adaptation
- Since the coordination process is ongoing, it can adapt to changing circumstances such as resource failure or load variation.
- Each scheduling agent needs to interact with only a limited number of other scheduling agents (those agents whose resources could affect its own resources). Thus, a local communication failure will affect only a limited number of scheduling agents. In other words, the coordination process is robust.
Scheduling is known to be a hard problem, so it might seem that this approach is vulnernable to phase transitions in which the computational complexity soars when the load on the network is close to critical. However, it seems that the iterative repair approach avoids these phase transitions by not attempting to find truly optimal solutions: each scheduler performs local improvements based on limited knowledge gained from a small number of other schedulers.
Of course, it is still entirely possible for a badly designed decentralized, anytime, local-repair scheduler to incur high costs and deliver poor performance. All that can be claimed from this generic discussion is that a promising framework is available. Detailed assessment of a specific scheduler can be performed in the context of the challenge problem.
Our presentation from the spring 2000 ANTS meeting has more detail on scheduling and anytime algorithms.