Warning: This document is for the development version of Bareos Documentation. The main version is bareos-21.

Job Execution

Introduction

The different job types are executed differently and also the protocol and information exchange between the daemons differ depending on job options. Based on a few example jobs the following documentation will try to describe the information exchange.

Job setup and start

When a job is started the Bareos Director will invoke RunJob() what will call SetupJob() to initialize the job and then pass it to JobqAdd(). After this has happened the job is in the job-queue waiting for a jobq_server to pick it up and actually run it.

@startuml
|RunJob|
start
|SetupJob|
:InitMsg();
#aqua:initialize term_wait condition]
#aqua:initialize nextrun_ready condition]
:CreateUniqueJobName()|
:set job status to JS_Created;
:get db connection from pool;
:InitJcrJobRecord()|
if (job has client) then (yes)
  :GetOrCreateClientRecord()|
endif
:CreateJobRecord()|
:set jcr->jobid from jcr->jr.jobid]
:NewPlugins()|
:DispatchNewPluginOptions()|
:GeneratePluginEvent(bDirEventJobStart)|
if (JobReads && !jcr->res.read_storage_list) then (yes)
  :CopyRwstorage()|
endif
partition "Type Specific Setup" {
  :lots of magic omitted here;
}
:GeneratePluginEvent(bDirEventJobInit)|
|RunJob|
if (success) then (yes)
  |JobqAdd|
  :increment jcr's use counter]
  if (scheduled in the future) then (yes)
    :spawn waiter thread;
  else (no)
    if (job canceled) then (yes)
      :put on top of ready_jobs;
    else (no)
      :priority-aware insert to waiting_jobs;
    endif
    :StartServer()|
    note
      this will spawn a new jobq_server
      thread each time it is called until
      the jobq's thread limit is reached.
    end note
  endif
  |RunJob|
  if (success) then (yes)
    #tomato:return jobid;
    note
      at this point the jcr is either
      somewhere in the job-queue or there
      is a thread running that will inject
      it at some point in the future.
    end note
    detach
  endif
endif
#tomato:return 0;
detach
@enduml

After the jobq_server picks up the job a job_thread is started. That thread does some more setup work and then runs the type-specific job payload.

@startuml
start
:detach thread;
:set job status to JS_Running;

if (max start delay exceeded) then (yes)
  #tomato:cancel job;
  detach
endif
if (max run sched time exceeded) then (yes)
  #tomato:cancel job;
  detach
endif
:UpdateJobStartRecord()|
:RunScripts(BeforeJob)|
:UpdateJobStartRecord()|
note
  this happens twice so files created
  by a runscript are not picked up
  twice
end note
:GeneratePluginEvent(bDirEventJobRun)|

partition "Job-Type specific run" {
  :this is where the actual job run happens;
}

:warn if subscriptions exceeded;

:RunScripts(AfterJob)|
:DequeueMessages()|
:GeneratePluginEvent(bDirEventJobEnd)|
end
@enduml

Simple backup job

As there are lots of configuration options that will change the job execution in subtle ways, we’re going to assume several things.

  • the Bareos File Daemon is not an active client, so the Bareos Director initiates the connection
  • the Bareos File Daemon is not a passive client, so the Bareos File Daemon initiated the connection to the Bareos Storage Daemon

When such a job is run, the Bareos Director connects to the Bareos Storage Daemon and does the initial job setup, then the Bareos Director connects to the Bareos File Daemon to setup and start the job there. The Bareos File Daemon then connects to the Bareos Storage Daemon and sends it data there.

Overview simple backup job

@startuml
participant d as "Director"
participant s as "Storage Daemon"
participant f as "File Daemon"
d -> s : authenticate
d -> s : send plugin options
alt if reschedulung
  d -> s : cancel previous job
end
d -> s : setup job
d -> s : reserve device for append
d -> s : start job
== Message thread for SD communication spawned ==
s -> d : job status: waiting for filedaemon
d -> f : authenticate
d -> f : setup job
d -> f : tell fd to connect to sd
f -> s : authenticate
f -> d : tell dir that connection to sd is ready
s -> d : tell dir that fd has connected, job status: running
d -> f : send runscripts for client
d -> f : execute run before scripts
alt if accurate
d -> f : send accurate file list
end
d -> f : start backup
f -> s : open session
s -> d : job status: running
f -> s : data records
f -> s : BNET_EOD
f -> s : close session
s -> f : BNET_EOD
f -> s : BNET_TERMINATE
f -> d : tell dir that job has finished
f -> d : dequeue messages
d -> f : BNET_TERMINATE
s -> d : dequeue messages
s -> d : tell dir that job has finished
== Message thread for SD communication exits ==
s -> d : BNET_EOD
@enduml

Detailed View simple backup job

@startuml
participant d as "Director"
participant s as "Storage Daemon"
participant f as "File Daemon"
group Initiate dir to sd connection
  d -> s : Hello
  s -> d : CRAM-MD5 Challenge
  d -> s : CRAM-MD5 Response
  alt success
    s -> d : 1000 OK auth
  else failure
    s -> d : 1999 Authorization failed.
  end
  d -> s : CRAM-MD5 Challenge
  s -> d : CRAM-MD5 Response
  alt success
    d -> s : 1000 OK auth
  else failure
    d -> s : 1999 Authorization failed.
  end
end
loop each option
  d -> s : pluginoptions %s
  s -> d : 2000 OK plugin options
end
alt if reschedulung
  d -> s : cancel Job=%s
  s -> d : 3000 JobId=%ld Job="%s" marked to be %s.
end
d -> s : JobId=%s [...]
s -> d : 3000 OK Job SDid=%d SDtime=%d Authorization=%100s
d -> s : getSecureEraseCmd
s -> d : 2000 OK SDSecureEraseCmd %s
loop each write_storage
  d -> s : use storage=%s media_type=%s pool_name=%s pool_type=%s append=1 copy=%d stripe=%d
  loop each device
    d -> s : use device=%s
  end
  d -> s : BNET_EOD
end
d -> s : BNET_EOD
s -> d : 3000 OK use device device=%s
d -> s : run
== Message thread for SD communication spawned ==
s -> d : Status Job=%s JobStatus=70
note right
  70 is numeric for 'F' which
  means waiting for filedaemon
end note
group Initiate dir to fd connection
  d -> f : Hello Director %s calling
  f -> d : CRAM-MD5 Challenge
  d -> f : CRAM-MD5 Response
  alt success
    f -> d : 1000 OK auth
  else failure
    f -> d : 1999 Authorization failed.
  end
  d -> f : CRAM-MD5 Challenge
  f -> d : CRAM-MD5 Response
  alt success
    d -> f : 1000 OK auth
  else failure
    d -> f : 1999 Authorization failed.
  end

end
' == SendJobInfoToFileDaemon() ==
d -> f : JobId=%s Job=%s SDid=%u SDtime=%u Authorization=%s [ssl=%d]
f -> d : 2000 OK Job %s (%s) %s,%s,%s,%s,%s
' == SendLevelCommand() ==
d -> f : level = [accurate_]<base|full|differential|incremental>[ rerunning ] mtimeonly=0
alt if differential or incremental
  d -> f : level = since_utime=<stime> mtimeonly=0 prev_job=<PrevJob>
end
f -> d : 2000 OK level
' == SendIncludeList() ==
group send fileset to fd
  d -> f : fileset[ vss=1]
  loop each include item, then each exclude item
    d -> f : I/E
    note right
      I for include, E for exclude
    end note
    loop each ignoredir
      d -> f : Z <ignoredir>
    end
    loop each option-block
      d -> f : O <options>
      loop each regex
        d -> f : R <regex>
      end
      loop each regexdir
        d -> f : RD <regex>
      end
      loop each regexfile
        d -> f : RF <regex>
      end
      loop each wild
        d -> f : W <wild>
      end
      loop each wilddir
        d -> f : WD <wild>
      end
      loop each wildfile
        d -> f : WF <wild>
      end
      loop each wildbase
        d -> f : WB <wild>
      end
      loop each base
        d -> f : B <base>
      end
      loop each fstype
        d -> f : X <fstype>
      end
      loop each drivetype
        d -> f : XD <drivetype>
      end
      alt if plugin
        d -> f : G <plugin>
      end
      alt if reader
        d -> f : D <reader>
      end
      alt if writer
        d -> f : T <writer>
      end
      d -> f : N
    end
    loop name_list
      d -> f : F <item>
    end
    d -> f : N
    loop plugin_list
      d -> f : F <plugin>
    end
    d -> f : N
  end
  d -> f : BNET_EOD
  f -> d : 2000 OK include
end
'== SendExcludeList() ==
' this function does nothing
'== SendPreviousRestoreObjects() ==
alt if (incr or diff) and restore objects exist
  loop each restore object
    d -> f : restoreobject JobId=%s %s,%s,[...]
    d -> f : <object-name>
    d -> f : <object-value>
  end
  d -> f : restoreobject end
  f -> d : 2000 OK ObjectRestored
end
' == SendSecureEraseReqToFd() ==
d -> f : getSecureEraseCmd
f -> d : 2000 OK FDSecureEraseCmd <erase-cmd>
'== SendBwLimitToFd() ==
d -> f : setbandwidth=%d Job=%s
f -> d : 2000 OK Bandwidth

d -> f : storage address=%s port=%d ssl=%d
group Initiate fd to sd connection
  f -> s : Hello Start Job %s
  s -> f : CRAM-MD5 Challenge
  f -> s : CRAM-MD5 Response
  alt success
    s -> f : 1000 OK auth
  else failure
    s -> f : 1999 Authorization failed.
  end
  f -> s : CRAM-MD5 Challenge
  s -> f : CRAM-MD5 Response
  alt success
    f -> s : 1000 OK auth
  else failure
    f -> s : 1999 Authorization failed.
  end
end
f -> d : 2000 OK storage
s -> d : 3010 Job %s start
s -> d : Status Job=%s JobStatus=82
note right
  82 is numeric for 'R' which
  means running
end note
'== SendRunscriptsCommands() ==
alt if runscripts for client
  loop each runscript for this level
    d -> f : Run OnSuccess=%u OnFailure=%u AbortOnError=%u When=%u Command=%s
    f -> d : 2000 OK RunScript
  end
  alt if before script
    d -> f : RunBeforeNow
    f -> d : 2000 OK RunBeforeNow
  end
end
'== SendAccurateCurrentFiles() ==
alt if accurate enabled
  d -> f : accurate files=<approx-number-of-files>
  loop each accurate file
    d -> f : /path/to/file\0LStat\0MD5\0DeltaSeq
  end
  d -> f : BNET_EOD
end
d -> f : backup FileIndex=%ld
f -> d : 2000 OK backup
f -> s : append open session
s -> f : 3000 OK open ticket = <ticket-no>
f -> s : append data <ticket-no>
s -> d : Status Job=%s JobStatus=82
note right
  82 is numeric for 'R' which
  means running
end note
s -> f : 3000 OK data
f -> s : Data Records
f -> s : BNET_EOD
s -> f : 3000 OK append data
f -> s : append end session <ticket-no>
s -> f : 3000 OK end
f -> s : append close session <ticket-no>
s -> f : 3000 OK close Status = %d
s -> f : BNET_EOD
f -> s : BNET_TERMINATE
f -> d : 2800 End Job TermCode=%d JobFiles=%u ReadBytes=%s JobBytes=%s Errors=%u VSS=%d Encrypt=%d
loop each queued message
  f -> d : Jmsg Job=%s type=%d level=%lld %s
end
d -> f : BNET_TERMINATE

loop each queued message
  s -> d : Jmsg Job=%s type=%d level=%lld %s
end
s -> d : Status Job=%s JobStatus=84
note right
  82 is numeric for 'T' which
  means terminated normally
end note
s -> d : 3099 Job %s end JobStatus=%d JobFiles=%d JobBytes=%s JobErrors=%u
== Message thread for SD communication exits ==
s -> d : BNET_EOD
@enduml

Local copy or migrate job

The local copy reads records from one volume and writes them to another volume on the same Bareos Storage Daemon. None of the data is transferred over the network.

When such a job is run, the Bareos Director connects to the Bareos Storage Daemon and tells is what data to read from which volume and what volume the records should be written to.

Overview local copy job

@startuml
participant d as "Director"
participant s as "Storage Daemon"
d -> s : authenticate
d -> s : send plugin options
alt if reschedulung
  d -> s : cancel previous job
end
d -> s : setup job
d -> s : reserve device for read
d -> s : reserve device for write
d -> s : start job
== Message thread for SD communication spawned ==
s -> d : jobstatus: running
s -> d : dequeue messages
s -> d : send current jobstatus
s -> d : tell dir that job has finished
== Message thread for SD communication exits ==
s -> d : BNET_EOD
@enduml

Detailed View local copy job

@startuml
participant d as "Director"
participant s as "Storage Daemon"
group Initiate dir to sd connection
  d -> s : Hello
  s -> d : CRAM-MD5 Challenge
  d -> s : CRAM-MD5 Response
  alt success
    s -> d : 1000 OK auth
  else failure
    s -> d : 1999 Authorization failed.
  end
  d -> s : CRAM-MD5 Challenge
  s -> d : CRAM-MD5 Response
  alt success
    d -> s : 1000 OK auth
  else failure
    d -> s : 1999 Authorization failed.
  end
end
loop each option
  d -> s : pluginoptions %s
  s -> d : 2000 OK plugin options
end
alt if reschedulung
  d -> s : cancel Job=%s
  s -> d : 3000 JobId=%ld Job="%s" marked to be %s.
end
d -> s : JobId=%s [...]
s -> d : 3000 OK Job SDid=%d SDtime=%d Authorization=%100s
d -> s : getSecureEraseCmd
s -> d : 2000 OK SDSecureEraseCmd %s
loop each read_storage
  d -> s : use storage=%s media_type=%s pool_name=%s pool_type=%s append=0 copy=%d stripe=%d
  loop each device
    d -> s : use device=%s
  end
  d -> s : BNET_EOD
end
d -> s : BNET_EOD
s -> d : 3000 OK use device device=%s
loop each write_storage
  d -> s : use storage=%s media_type=%s pool_name=%s pool_type=%s append=1 copy=%d stripe=%d
  loop each device
    d -> s : use device=%s
  end
  d -> s : BNET_EOD
end
d -> s : BNET_EOD
s -> d : 3000 OK use device device=%s
d -> s : run
' done till here
== Message thread for SD communication spawned ==
s -> d : Status Job=%s JobStatus=82
note right
  82 is numeric for 'R' which
  means waiting for filedaemon
end note

loop each queued message
  s -> d : Jmsg Job=%s type=%d level=%lld %s
end
s -> d : Status Job=%s JobStatus=%d
note right
  the current job status is transmitted
end note
s -> d : 3099 Job %s end JobStatus=%d JobFiles=%d JobBytes=%s JobErrors=%u
== Message thread for SD communication exits ==
s -> d : BNET_EOD
@enduml

Remote copy or migrate job

The remote copy or migrate basically reads records from one volume and writes them to another one on a different Bareos Storage Daemon. From a networking perspective copy and migrate are not really distinguishable. The main difference is what the director writes to the catalog after the job is finished.

When such a remote copy or migrate job is run, the Bareos Director connects to the reading Bareos Storage Daemon and then to the writing Bareos Storage Daemon. The writing Bareos Storage Daemon is put into listen-mode while the writing Bareos Storage Daemon will essentially run a restore where the data is sent to the writing Bareos Storage Daemon.

Overview remote copy job

@startuml
participant d as "Director"
participant r as "Read SD"
participant w as "Write SD"
d -> r : authenticate read sd
d -> w : authenticate write sd

d -> r : send plugin options
alt if reschedulung
  d -> r : cancel previous job
end

d -> r : setup job
d -> r : send bootstrap
d -> r : reserve device for read

d -> w : send plugin options
alt if reschedulung
  d -> w : cancel previous job
end
d -> w : setup job
d -> w : reserve device for write

d -> w : switch to listen mode
== Message thread for Write SD communication spawned ==
w -> d : jobstatus: waiting for storage daemon

d -> r : replicate command
r -> w : authenticate sd to sd

w -> d : jobstatus : running

d -> r : start job
== Message thread for Read SD communication spawned ==
r -> d : jobstatus: running

w -> d : jobstatus: running
r -> w : replicate data

r -> d : send jobstatus
r -> d : dequeue messages
r -> d : tell dir that job has finished
== Message thread for Read SD communication exits ==

w -> d : dequeue messages
w -> d : tell dir that job has finished
== Message thread for Write SD communication exits ==
@enduml

Detailed View remote copy job

@startuml
participant d as "Director"
participant r as "Read SD"
participant w as "Write SD"
group Initiate dir to read sd connection
  d -> r : Hello
  r -> d : CRAM-MD5 Challenge
  d -> r : CRAM-MD5 Response
  alt success
    r -> d : 1000 OK auth
  else failure
    r -> d : 1999 Authorization failed.
  end
  d -> r : CRAM-MD5 Challenge
  r -> d : CRAM-MD5 Response
  alt success
    d -> r : 1000 OK auth
  else failure
    d -> r : 1999 Authorization failed.
  end
end
group Initiate dir to write sd connection
  d -> w : Hello
  w -> d : CRAM-MD5 Challenge
  d -> w : CRAM-MD5 Response
  alt success
    w -> d : 1000 OK auth
  else failure
    w -> d : 1999 Authorization failed.
  end
  d -> w : CRAM-MD5 Challenge
  w -> d : CRAM-MD5 Response
  alt success
    d -> w : 1000 OK auth
  else failure
    d -> w : 1999 Authorization failed.
  end
end

loop each option
  d -> r : pluginoptions %s
  r -> d : 2000 OK plugin options
end
alt if reschedulung
  d -> r : cancel Job=%s
  r -> d : 3000 JobId=%ld Job="%s" marked to be %s.
end
d -> r : JobId=%s [...]
r -> d : 3000 OK Job SDid=%d SDtime=%d Authorization=%100s
d -> r : bootstrap
d -> r : <bootstrap content>
d -> r : BNET_EOD
r -> d : 2000 OK bootstrap
d -> r : getSecureEraseCmd
r -> d : 2000 OK SDSecureEraseCmd %s
loop each read_storage
  d -> r : use storage=%s media_type=%s pool_name=%s pool_type=%s append=0 copy=%d stripe=%d
  loop each device
    d -> r : use device=%s
  end
  d -> r : BNET_EOD
end
d -> r : BNET_EOD
r -> d : 3000 OK use device device=%s


loop each option
  d -> w : pluginoptions %s
  w -> d : 2000 OK plugin options
end
alt if reschedulung
  d -> w : cancel Job=%s
  w -> d : 3000 JobId=%ld Job="%s" marked to be %s.
end
d -> w : JobId=%s [...]
w -> d : 3000 OK Job SDid=%d SDtime=%d Authorization=%100s
d -> w : getSecureEraseCmd
w -> d : 2000 OK SDSecureEraseCmd %s
loop each write_storage
  d -> w : use storage=%s media_type=%s pool_name=%s pool_type=%s append=1 copy=%d stripe=%d
  loop each device
    d -> w : use device=%s
  end
  d -> w : BNET_EOD
end
d -> w : BNET_EOD
w -> d : 3000 OK use device device=%s

d -> r : setbandwidth=%d Job=%s
r -> d : 2000 OK Bandwidth
d -> w : listen
== Message thread for Write SD communication spawned ==
w -> d : Status Job=%s JobStatus=83
note right
  83 is numeric for 'S' which
  means waiting for storage daemon
end note
d -> r : replicate JobId=%d Job=%s address=<write sd> port=%d ssl=%d Authorization=%s

group Initiate read sd to write sd connection
  r -> w : Hello Start Storage Job %s
  w -> r : CRAM-MD5 Challenge
  r -> w : CRAM-MD5 Response
  alt success
    w -> r : 1000 OK auth
  else failure
    w -> r : 1999 Authorization failed.
  end
  r -> w : CRAM-MD5 Challenge
  w -> r : CRAM-MD5 Response
  alt success
    r -> w : 1000 OK auth
  else failure
    r -> w : 1999 Authorization failed.
  end
end
r -> d : 3000 OK replicate

w -> d : 3010 Job %s start
w -> d : Status Job=%s JobStatus=82
note right
  82 is numeric for 'R' which
  means running
end note

d -> r : run
== Message thread for Read SD communication spawned ==
r -> d : Status Job=%s JobStatus=82
note right
  82 is numeric for 'R' which
  means running
end note

r -> w : start replicate
w -> r : 3000 OK start replicate
r -> w : replicate data %d
w -> d : Status Job=%s JobStatus=82
note right
  82 is numeric for 'R' which
  means running
end note
w -> r : 3000 OK data

r -> w : DataRecords

r -> w : BNET_EOD
r -> w : BNET_EOD
r -> w : end replicate
w -> r : 3000 OK end replicate



r -> d : Status Job=%s JobStatus=%d
note right
  send current job status
end note
loop each queued message
  r -> d : Jmsg Job=%s type=%d level=%lld %s
end
r -> d : Status Job=%s JobStatus=84
note right
  84 is numeric for 'T' which
  means terminated
end note
r -> d : 3099 Job %s end JobStatus=%d JobFiles=%d JobBytes=%s JobErrors=%u
== Message thread for Read SD communication exits ==
r -> d : BNET_EOD


loop each queued message
  w -> d : Jmsg Job=%s type=%d level=%lld %s
end
w -> d : Status Job=%s JobStatus=84
note right
  84 is numeric for 'T' which
  means terminated
end note
w -> d : 3099 Job %s end JobStatus=%d JobFiles=%d JobBytes=%s JobErrors=%u
== Message thread for Write SD communication exits ==
w -> d : BNET_EOD
@enduml