Tensorflow models workflow

介绍

Model Garden for TensorFlow 是一个存储库, 其中包含许多针对 TensorFlow 用户的最新(SOTA)模型和建模解决方案的实现。 旨在演示建模的最佳实践, 以便 TensorFlow 用户可以充分利用 TensorFlow 进行研究和产品开发。

代码分为以下三部分:

  • official : 使用最新的 TensorFlow 2 的高级 APISOTA 模型的示例实现示例集合, 由 TensorFlow 官方维护, 支持并保持最新的 TensorFlow 2 API, 经过合理优化的快速性能, 同时仍易于阅读;
  • research : 研究人员在 TensorFlow 12 中收集研究模型的实现, 由研究人员维护和支持;
  • community : 由 TensorFlow 2 提供支持的 GitHub 存储库的精选列表以及机器学习模型和实现.

使用

使用 official

具体使用可以到 official 中找到与自己项目比较接近的参考安装和使用, 官方介绍的都很详细, 这里就不在啰嗦了.

使用 research

这里以目标检测为例介绍。

环境配置

安装以下Python库

1
2
3
4
5
6
7
8
9
10
11
12
Protobuf 3.0.0
Python-tk
Pillow 1.0
lxml
tf-slim (https://github.com/google-research/tf-slim)
slim (which is included in the "tensorflow/models/research/" checkout)
Jupyter notebook
Matplotlib
Tensorflow (1.15.0)
Cython
contextlib2
cocoapi

COCO API

为了使用 COCO 数据集的评价指标,需要安装 COCO 数据集的 API:

1
2
3
4
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
make
cp -r pycocotools <path_to_tensorflow>/models/research/

或者直接运行以下:

1
pip install --user pycocotools

Protobuf

1
2
# From tensorflow/models/research/
protoc object_detection/protos/*.proto --python_out=.

设置 PYTHONPATH

1
2
# From tensorflow/models/research/
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim

验证安装是否成功

1
2
# If using Tensorflow 1.X:
python object_detection/builders/model_builder_tf1_test.py

推荐的训练和验证文件结构树

1
2
3
4
5
6
7
8
9
+data
-label_map file
-train TFRecord file
-eval TFRecord file
+models
+ model
-pipeline config file
+train
+eval

训练

1
2
3
4
5
6
7
8
9
10
11
12
# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
NUM_TRAIN_STEPS=50000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1

python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr

where ${PIPELINE_CONFIG_PATH} points to the pipeline config and ${MODEL_DIR} points to the directory in which training checkpoints and events will be written to. Note that this binary will interleave both training and evaluation.

  • 目标检测的配置文件样例在: models/research/object_detection/samples/configs/ 下;

  • 若模型保存的数量不满足要求,则需要自己修改 models/research/object_detection/model_main.py 文件:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    # http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    # ==============================================================================
    """Binary to run train and evaluation on object detection model."""

    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function

    from absl import flags

    import tensorflow.compat.v1 as tf

    from object_detection import model_hparams
    from object_detection import model_lib

    flags.DEFINE_string(
    'model_dir', None, 'Path to output model directory '
    'where event and checkpoint files will be written.')
    flags.DEFINE_string('pipeline_config_path', None, 'Path to pipeline config '
    'file.')
    flags.DEFINE_integer('num_train_steps', None, 'Number of train steps.')
    flags.DEFINE_boolean('eval_training_data', False,
    'If training data should be evaluated for this job. Note '
    'that one call only use this in eval-only mode, and '
    '`checkpoint_dir` must be supplied.')
    flags.DEFINE_integer('sample_1_of_n_eval_examples', 1, 'Will sample one of '
    'every n eval input examples, where n is provided.')
    flags.DEFINE_integer('sample_1_of_n_eval_on_train_examples', 5, 'Will sample '
    'one of every n train input examples for evaluation, '
    'where n is provided. This is only used if '
    '`eval_training_data` is True.')
    flags.DEFINE_string(
    'hparams_overrides', None, 'Hyperparameter overrides, '
    'represented as a string containing comma-separated '
    'hparam_name=value pairs.')
    flags.DEFINE_string(
    'checkpoint_dir', None, 'Path to directory holding a checkpoint. If '
    '`checkpoint_dir` is provided, this binary operates in eval-only mode, '
    'writing resulting metrics to `model_dir`.')
    flags.DEFINE_boolean(
    'run_once', False, 'If running in eval-only mode, whether to run just '
    'one round of eval vs running continuously (default).'
    )
    flags.DEFINE_integer(
    'max_eval_retries', 0, 'If running continuous eval, the maximum number of '
    'retries upon encountering tf.errors.InvalidArgumentError. If negative, '
    'will always retry the evaluation.'
    )
    FLAGS = flags.FLAGS


    def main(unused_argv):
    flags.mark_flag_as_required('model_dir')
    flags.mark_flag_as_required('pipeline_config_path')
    config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir) # 需要根据实际情况修改输入参数

    train_and_eval_dict = model_lib.create_estimator_and_inputs(
    run_config=config,
    hparams=model_hparams.create_hparams(FLAGS.hparams_overrides),
    pipeline_config_path=FLAGS.pipeline_config_path,
    train_steps=FLAGS.num_train_steps,
    sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,
    sample_1_of_n_eval_on_train_examples=(
    FLAGS.sample_1_of_n_eval_on_train_examples))
    estimator = train_and_eval_dict['estimator']
    train_input_fn = train_and_eval_dict['train_input_fn']
    eval_input_fns = train_and_eval_dict['eval_input_fns']
    eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
    predict_input_fn = train_and_eval_dict['predict_input_fn']
    train_steps = train_and_eval_dict['train_steps']

    if FLAGS.checkpoint_dir:
    if FLAGS.eval_training_data:
    name = 'training_data'
    input_fn = eval_on_train_input_fn
    else:
    name = 'validation_data'
    # The first eval input will be evaluated.
    input_fn = eval_input_fns[0]
    if FLAGS.run_once:
    estimator.evaluate(input_fn,
    steps=None,
    checkpoint_path=tf.train.latest_checkpoint(
    FLAGS.checkpoint_dir))
    else:
    model_lib.continuous_eval(estimator, FLAGS.checkpoint_dir, input_fn,
    train_steps, name, FLAGS.max_eval_retries)
    else:
    train_spec, eval_specs = model_lib.create_train_and_eval_specs(
    train_input_fn,
    eval_input_fns,
    eval_on_train_input_fn,
    predict_input_fn,
    train_steps,
    eval_on_train_data=False)

    # Currently only a single Eval Spec is allowed.
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])


    if __name__ == '__main__':
    tf.app.run()

    其中 tf.estimator.RunConfig 函数的定义如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    class RunConfig(object):
    """This class specifies the configurations for an `Estimator` run."""

    def __init__(self,
    model_dir=None,
    tf_random_seed=None,
    save_summary_steps=100,
    save_checkpoints_steps=_USE_DEFAULT,
    save_checkpoints_secs=_USE_DEFAULT,
    session_config=None,
    keep_checkpoint_max=5,
    keep_checkpoint_every_n_hours=10000,
    log_step_count_steps=100,
    train_distribute=None,
    device_fn=None,
    protocol=None,
    eval_distribute=None,
    experimental_distribute=None,
    experimental_max_worker_delay_secs=None,
    session_creation_timeout_secs=7200):
    """Constructs a RunConfig.

    All distributed training related properties `cluster_spec`, `is_chief`,
    `master` , `num_worker_replicas`, `num_ps_replicas`, `task_id`, and
    `task_type` are set based on the `TF_CONFIG` environment variable, if the
    pertinent information is present. The `TF_CONFIG` environment variable is a
    JSON object with attributes: `cluster` and `task`.

    `cluster` is a JSON serialized version of `ClusterSpec`'s Python dict from
    `server_lib.py`, mapping task types (usually one of the `TaskType` enums) to
    a list of task addresses.

    `task` has two attributes: `type` and `index`, where `type` can be any of
    the task types in `cluster`. When `TF_CONFIG` contains said information,
    the following properties are set on this class:

    * `cluster_spec` is parsed from `TF_CONFIG['cluster']`. Defaults to {}. If
    present, must have one and only one node in the `chief` attribute of
    `cluster_spec`.
    * `task_type` is set to `TF_CONFIG['task']['type']`. Must set if
    `cluster_spec` is present; must be `worker` (the default value) if
    `cluster_spec` is not set.
    * `task_id` is set to `TF_CONFIG['task']['index']`. Must set if
    `cluster_spec` is present; must be 0 (the default value) if
    `cluster_spec` is not set.
    * `master` is determined by looking up `task_type` and `task_id` in the
    `cluster_spec`. Defaults to ''.
    * `num_ps_replicas` is set by counting the number of nodes listed
    in the `ps` attribute of `cluster_spec`. Defaults to 0.
    * `num_worker_replicas` is set by counting the number of nodes listed
    in the `worker` and `chief` attributes of `cluster_spec`. Defaults to 1.
    * `is_chief` is determined based on `task_type` and `cluster`.

    There is a special node with `task_type` as `evaluator`, which is not part
    of the (training) `cluster_spec`. It handles the distributed evaluation job.

    Example of non-chief node:

    cluster = {'chief': ['host0:2222'],
    'ps': ['host1:2222', 'host2:2222'],
    'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
    os.environ['TF_CONFIG'] = json.dumps(
    {'cluster': cluster,
    'task': {'type': 'worker', 'index': 1}})
    config = RunConfig()
    assert config.master == 'host4:2222'
    assert config.task_id == 1
    assert config.num_ps_replicas == 2
    assert config.num_worker_replicas == 4
    assert config.cluster_spec == server_lib.ClusterSpec(cluster)
    assert config.task_type == 'worker'
    assert not config.is_chief

    Example of chief node:

    cluster = {'chief': ['host0:2222'],
    'ps': ['host1:2222', 'host2:2222'],
    'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
    os.environ['TF_CONFIG'] = json.dumps(
    {'cluster': cluster,
    'task': {'type': 'chief', 'index': 0}})
    config = RunConfig()
    assert config.master == 'host0:2222'
    assert config.task_id == 0
    assert config.num_ps_replicas == 2
    assert config.num_worker_replicas == 4
    assert config.cluster_spec == server_lib.ClusterSpec(cluster)
    assert config.task_type == 'chief'
    assert config.is_chief


    Example of evaluator node (evaluator is not part of training cluster):

    cluster = {'chief': ['host0:2222'],
    'ps': ['host1:2222', 'host2:2222'],
    'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
    os.environ['TF_CONFIG'] = json.dumps(
    {'cluster': cluster,
    'task': {'type': 'evaluator', 'index': 0}})
    config = RunConfig()
    assert config.master == ''
    assert config.evaluator_master == ''
    assert config.task_id == 0
    assert config.num_ps_replicas == 0
    assert config.num_worker_replicas == 0
    assert config.cluster_spec == {}
    assert config.task_type == 'evaluator'
    assert not config.is_chief


    N.B.: If `save_checkpoints_steps` or `save_checkpoints_secs` is set,
    `keep_checkpoint_max` might need to be adjusted accordingly, especially in
    distributed training. For example, setting `save_checkpoints_secs` as 60
    without adjusting `keep_checkpoint_max` (defaults to 5) leads to situation
    that checkpoint would be garbage collected after 5 minutes. In distributed
    training, the evaluation job starts asynchronously and might fail to load or
    find the checkpoint due to race condition.

    Args:
    model_dir: directory where model parameters, graph, etc are saved. If
    `PathLike` object, the path will be resolved. If `None`, will use a
    default value set by the Estimator.
    tf_random_seed: Random seed for TensorFlow initializers.
    Setting this value allows consistency between reruns.
    save_summary_steps: Save summaries every this many steps.
    save_checkpoints_steps: Save checkpoints every this many steps. Can not be
    specified with `save_checkpoints_secs`.
    save_checkpoints_secs: Save checkpoints every this many seconds. Can not
    be specified with `save_checkpoints_steps`. Defaults to 600 seconds if
    both `save_checkpoints_steps` and `save_checkpoints_secs` are not set
    in constructor. If both `save_checkpoints_steps` and
    `save_checkpoints_secs` are `None`, then checkpoints are disabled.
    session_config: a ConfigProto used to set session parameters, or `None`.
    keep_checkpoint_max: The maximum number of recent checkpoint files to
    keep. As new files are created, older files are deleted. If `None` or 0,
    all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent
    checkpoint files are kept.)
    keep_checkpoint_every_n_hours: Number of hours between each checkpoint
    to be saved. The default value of 10,000 hours effectively disables
    the feature.
    log_step_count_steps: The frequency, in number of global steps, that the
    global step and the loss will be logged during training. Also controls
    the frequency that the global steps / s will be logged (and written to
    summary) during training.
    train_distribute: An optional instance of `tf.distribute.Strategy`.
    If specified, then Estimator will distribute the user's model during
    training, according to the policy specified by that strategy. Setting
    `experimental_distribute.train_distribute` is preferred.
    device_fn: A callable invoked for every `Operation` that takes the
    `Operation` and returns the device string. If `None`, defaults to
    the device function returned by `tf.train.replica_device_setter`
    with round-robin strategy.
    protocol: An optional argument which specifies the protocol used when
    starting server. `None` means default to grpc.
    eval_distribute: An optional instance of `tf.distribute.Strategy`.
    If specified, then Estimator will distribute the user's model during
    evaluation, according to the policy specified by that strategy.
    Setting `experimental_distribute.eval_distribute` is preferred.
    experimental_distribute: An optional
    `tf.contrib.distribute.DistributeConfig` object specifying
    DistributionStrategy-related configuration. The `train_distribute` and
    `eval_distribute` can be passed as parameters to `RunConfig` or set in
    `experimental_distribute` but not both.
    experimental_max_worker_delay_secs: An optional integer
    specifying the maximum time a worker should wait before starting.
    By default, workers are started at staggered times, with each worker
    being delayed by up to 60 seconds. This is intended to reduce the risk
    of divergence, which can occur when many workers simultaneously update
    the weights of a randomly initialized model. Users who warm-start their
    models and train them for short durations (a few minutes or less) should
    consider reducing this default to improve training times.
    session_creation_timeout_secs: Max time workers should wait for a session
    to become available (on initialization or when recovering a session)
    with MonitoredTrainingSession. Defaults to 7200 seconds, but users may
    want to set a lower value to detect problems with variable / session
    (re)-initialization more quickly.

    Raises:
    ValueError: If both `save_checkpoints_steps` and `save_checkpoints_secs`
    are set.
    """

model_main.py 修改 keep_checkpoint_max=500

1
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500)

然后在 model_lib.py两处 修改 max_to_keep=500

1
2
3
4
5
6
7
8
9
10
11
saver = tf.train.Saver(
variables_to_restore,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
max_to_keep=500) # <= added max_to_keep argument here


saver = tf.train.Saver(
sharded=True,
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
save_relative_paths=True,
max_to_keep=500) # <= added max_to_keep argument here

使用 Tensorboard

Progress for training and eval jobs can be inspected using Tensorboard. If using the recommended directory structure, Tensorboard can be run using the following command:

1
tensorboard --logdir=${MODEL_DIR}

where ${MODEL_DIR} points to the directory that contains the train and eval directories. Please note it may take Tensorboard a couple minutes to populate with data.

测试验证集

1
2
3
python object_detection/legacy/eval.py --checkpoint_dir /path/to/model  --eval_dir /path/to/store/eval_dir \
--pipeline_config_path ~/tensorflow_models/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config
`

注意
.config 文件里 eval 的配置数量要跟验证集样本适量一致。

Exporting a trained model for inference

After your model has been trained, you should export it to a Tensorflow
graph proto. A checkpoint will typically consist of three files:

  • model.ckpt-${CHECKPOINT_NUMBER}.data-00000-of-00001
  • model.ckpt-${CHECKPOINT_NUMBER}.index
  • model.ckpt-${CHECKPOINT_NUMBER}.meta
    After you’ve identified a candidate checkpoint to export, run the following
    command from tensorflow/models/research:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # From tensorflow/models/research/
    INPUT_TYPE=image_tensor
    PIPELINE_CONFIG_PATH={path to pipeline config file}
    TRAINED_CKPT_PREFIX={path to model.ckpt}
    EXPORT_DIR={path to folder that will be used for export}
    python object_detection/export_inference_graph.py \
    --input_type=${INPUT_TYPE} \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --trained_checkpoint_prefix=${TRAINED_CKPT_PREFIX} \
    --output_directory=${EXPORT_DIR}

NOTE: We are configuring our exported model to ingest 4-D image tensors. We can also configure the exported model to take encoded images or serialized tf.Examples.

After export, you should see the directory ${EXPORT_DIR} containing the following:

  • saved_model/, a directory containing the saved model format of the exported model
  • frozen_inference_graph.pb, the frozen graph format of the exported model
  • model.ckpt.*, the model checkpoints used for exporting
  • checkpoint, a file specifying to restore included checkpoint files
  • pipeline.config, pipeline config file for the exported model

参考资料

  1. Welcome to the Model Garden for TensorFlow
0%