介绍
Model Garden for TensorFlow 是一个存储库, 其中包含许多针对 TensorFlow
用户的最新(SOTA
)模型和建模解决方案的实现。 旨在演示建模的最佳实践, 以便 TensorFlow
用户可以充分利用 TensorFlow
进行研究和产品开发。
代码分为以下三部分:
- official : 使用最新的
TensorFlow 2
的高级API
的SOTA
模型的示例实现示例集合, 由TensorFlow
官方维护, 支持并保持最新的TensorFlow 2 API
, 经过合理优化的快速性能, 同时仍易于阅读; - research : 研究人员在
TensorFlow 1
或2
中收集研究模型的实现, 由研究人员维护和支持; - community : 由
TensorFlow 2
提供支持的GitHub
存储库的精选列表以及机器学习模型和实现.
使用
使用 official
具体使用可以到 official 中找到与自己项目比较接近的参考安装和使用, 官方介绍的都很详细, 这里就不在啰嗦了.
使用 research
这里以目标检测为例介绍。
环境配置
安装以下Python库
1 | Protobuf 3.0.0 |
COCO API
为了使用 COCO
数据集的评价指标,需要安装 COCO
数据集的 API
:
1 | git clone https://github.com/cocodataset/cocoapi.git |
或者直接运行以下:
1 | pip install --user pycocotools |
Protobuf
1 | From tensorflow/models/research/ |
设置 PYTHONPATH
1 | From tensorflow/models/research/ |
验证安装是否成功
1 | If using Tensorflow 1.X: |
推荐的训练和验证文件结构树
1 | +data |
训练
1 | From the tensorflow/models/research/ directory |
where ${PIPELINE_CONFIG_PATH}
points to the pipeline config and ${MODEL_DIR}
points to the directory in which training checkpoints and events will be written to. Note that this binary will interleave both training and evaluation.
目标检测的配置文件样例在:
models/research/object_detection/samples/configs/
下;若模型保存的数量不满足要求,则需要自己修改
models/research/object_detection/model_main.py
文件:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Binary to run train and evaluation on object detection model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from absl import flags
import tensorflow.compat.v1 as tf
from object_detection import model_hparams
from object_detection import model_lib
flags.DEFINE_string(
'model_dir', None, 'Path to output model directory '
'where event and checkpoint files will be written.')
flags.DEFINE_string('pipeline_config_path', None, 'Path to pipeline config '
'file.')
flags.DEFINE_integer('num_train_steps', None, 'Number of train steps.')
flags.DEFINE_boolean('eval_training_data', False,
'If training data should be evaluated for this job. Note '
'that one call only use this in eval-only mode, and '
'`checkpoint_dir` must be supplied.')
flags.DEFINE_integer('sample_1_of_n_eval_examples', 1, 'Will sample one of '
'every n eval input examples, where n is provided.')
flags.DEFINE_integer('sample_1_of_n_eval_on_train_examples', 5, 'Will sample '
'one of every n train input examples for evaluation, '
'where n is provided. This is only used if '
'`eval_training_data` is True.')
flags.DEFINE_string(
'hparams_overrides', None, 'Hyperparameter overrides, '
'represented as a string containing comma-separated '
'hparam_name=value pairs.')
flags.DEFINE_string(
'checkpoint_dir', None, 'Path to directory holding a checkpoint. If '
'`checkpoint_dir` is provided, this binary operates in eval-only mode, '
'writing resulting metrics to `model_dir`.')
flags.DEFINE_boolean(
'run_once', False, 'If running in eval-only mode, whether to run just '
'one round of eval vs running continuously (default).'
)
flags.DEFINE_integer(
'max_eval_retries', 0, 'If running continuous eval, the maximum number of '
'retries upon encountering tf.errors.InvalidArgumentError. If negative, '
'will always retry the evaluation.'
)
FLAGS = flags.FLAGS
def main(unused_argv):
flags.mark_flag_as_required('model_dir')
flags.mark_flag_as_required('pipeline_config_path')
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir) # 需要根据实际情况修改输入参数
train_and_eval_dict = model_lib.create_estimator_and_inputs(
run_config=config,
hparams=model_hparams.create_hparams(FLAGS.hparams_overrides),
pipeline_config_path=FLAGS.pipeline_config_path,
train_steps=FLAGS.num_train_steps,
sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,
sample_1_of_n_eval_on_train_examples=(
FLAGS.sample_1_of_n_eval_on_train_examples))
estimator = train_and_eval_dict['estimator']
train_input_fn = train_and_eval_dict['train_input_fn']
eval_input_fns = train_and_eval_dict['eval_input_fns']
eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
predict_input_fn = train_and_eval_dict['predict_input_fn']
train_steps = train_and_eval_dict['train_steps']
if FLAGS.checkpoint_dir:
if FLAGS.eval_training_data:
name = 'training_data'
input_fn = eval_on_train_input_fn
else:
name = 'validation_data'
# The first eval input will be evaluated.
input_fn = eval_input_fns[0]
if FLAGS.run_once:
estimator.evaluate(input_fn,
steps=None,
checkpoint_path=tf.train.latest_checkpoint(
FLAGS.checkpoint_dir))
else:
model_lib.continuous_eval(estimator, FLAGS.checkpoint_dir, input_fn,
train_steps, name, FLAGS.max_eval_retries)
else:
train_spec, eval_specs = model_lib.create_train_and_eval_specs(
train_input_fn,
eval_input_fns,
eval_on_train_input_fn,
predict_input_fn,
train_steps,
eval_on_train_data=False)
# Currently only a single Eval Spec is allowed.
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
if __name__ == '__main__':
tf.app.run()其中
tf.estimator.RunConfig
函数的定义如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181class RunConfig(object):
"""This class specifies the configurations for an `Estimator` run."""
def __init__(self,
model_dir=None,
tf_random_seed=None,
save_summary_steps=100,
save_checkpoints_steps=_USE_DEFAULT,
save_checkpoints_secs=_USE_DEFAULT,
session_config=None,
keep_checkpoint_max=5,
keep_checkpoint_every_n_hours=10000,
log_step_count_steps=100,
train_distribute=None,
device_fn=None,
protocol=None,
eval_distribute=None,
experimental_distribute=None,
experimental_max_worker_delay_secs=None,
session_creation_timeout_secs=7200):
"""Constructs a RunConfig.
All distributed training related properties `cluster_spec`, `is_chief`,
`master` , `num_worker_replicas`, `num_ps_replicas`, `task_id`, and
`task_type` are set based on the `TF_CONFIG` environment variable, if the
pertinent information is present. The `TF_CONFIG` environment variable is a
JSON object with attributes: `cluster` and `task`.
`cluster` is a JSON serialized version of `ClusterSpec`'s Python dict from
`server_lib.py`, mapping task types (usually one of the `TaskType` enums) to
a list of task addresses.
`task` has two attributes: `type` and `index`, where `type` can be any of
the task types in `cluster`. When `TF_CONFIG` contains said information,
the following properties are set on this class:
* `cluster_spec` is parsed from `TF_CONFIG['cluster']`. Defaults to {}. If
present, must have one and only one node in the `chief` attribute of
`cluster_spec`.
* `task_type` is set to `TF_CONFIG['task']['type']`. Must set if
`cluster_spec` is present; must be `worker` (the default value) if
`cluster_spec` is not set.
* `task_id` is set to `TF_CONFIG['task']['index']`. Must set if
`cluster_spec` is present; must be 0 (the default value) if
`cluster_spec` is not set.
* `master` is determined by looking up `task_type` and `task_id` in the
`cluster_spec`. Defaults to ''.
* `num_ps_replicas` is set by counting the number of nodes listed
in the `ps` attribute of `cluster_spec`. Defaults to 0.
* `num_worker_replicas` is set by counting the number of nodes listed
in the `worker` and `chief` attributes of `cluster_spec`. Defaults to 1.
* `is_chief` is determined based on `task_type` and `cluster`.
There is a special node with `task_type` as `evaluator`, which is not part
of the (training) `cluster_spec`. It handles the distributed evaluation job.
Example of non-chief node:
cluster = {'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
{'cluster': cluster,
'task': {'type': 'worker', 'index': 1}})
config = RunConfig()
assert config.master == 'host4:2222'
assert config.task_id == 1
assert config.num_ps_replicas == 2
assert config.num_worker_replicas == 4
assert config.cluster_spec == server_lib.ClusterSpec(cluster)
assert config.task_type == 'worker'
assert not config.is_chief
Example of chief node:
cluster = {'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
{'cluster': cluster,
'task': {'type': 'chief', 'index': 0}})
config = RunConfig()
assert config.master == 'host0:2222'
assert config.task_id == 0
assert config.num_ps_replicas == 2
assert config.num_worker_replicas == 4
assert config.cluster_spec == server_lib.ClusterSpec(cluster)
assert config.task_type == 'chief'
assert config.is_chief
Example of evaluator node (evaluator is not part of training cluster):
cluster = {'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
{'cluster': cluster,
'task': {'type': 'evaluator', 'index': 0}})
config = RunConfig()
assert config.master == ''
assert config.evaluator_master == ''
assert config.task_id == 0
assert config.num_ps_replicas == 0
assert config.num_worker_replicas == 0
assert config.cluster_spec == {}
assert config.task_type == 'evaluator'
assert not config.is_chief
N.B.: If `save_checkpoints_steps` or `save_checkpoints_secs` is set,
`keep_checkpoint_max` might need to be adjusted accordingly, especially in
distributed training. For example, setting `save_checkpoints_secs` as 60
without adjusting `keep_checkpoint_max` (defaults to 5) leads to situation
that checkpoint would be garbage collected after 5 minutes. In distributed
training, the evaluation job starts asynchronously and might fail to load or
find the checkpoint due to race condition.
Args:
model_dir: directory where model parameters, graph, etc are saved. If
`PathLike` object, the path will be resolved. If `None`, will use a
default value set by the Estimator.
tf_random_seed: Random seed for TensorFlow initializers.
Setting this value allows consistency between reruns.
save_summary_steps: Save summaries every this many steps.
save_checkpoints_steps: Save checkpoints every this many steps. Can not be
specified with `save_checkpoints_secs`.
save_checkpoints_secs: Save checkpoints every this many seconds. Can not
be specified with `save_checkpoints_steps`. Defaults to 600 seconds if
both `save_checkpoints_steps` and `save_checkpoints_secs` are not set
in constructor. If both `save_checkpoints_steps` and
`save_checkpoints_secs` are `None`, then checkpoints are disabled.
session_config: a ConfigProto used to set session parameters, or `None`.
keep_checkpoint_max: The maximum number of recent checkpoint files to
keep. As new files are created, older files are deleted. If `None` or 0,
all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent
checkpoint files are kept.)
keep_checkpoint_every_n_hours: Number of hours between each checkpoint
to be saved. The default value of 10,000 hours effectively disables
the feature.
log_step_count_steps: The frequency, in number of global steps, that the
global step and the loss will be logged during training. Also controls
the frequency that the global steps / s will be logged (and written to
summary) during training.
train_distribute: An optional instance of `tf.distribute.Strategy`.
If specified, then Estimator will distribute the user's model during
training, according to the policy specified by that strategy. Setting
`experimental_distribute.train_distribute` is preferred.
device_fn: A callable invoked for every `Operation` that takes the
`Operation` and returns the device string. If `None`, defaults to
the device function returned by `tf.train.replica_device_setter`
with round-robin strategy.
protocol: An optional argument which specifies the protocol used when
starting server. `None` means default to grpc.
eval_distribute: An optional instance of `tf.distribute.Strategy`.
If specified, then Estimator will distribute the user's model during
evaluation, according to the policy specified by that strategy.
Setting `experimental_distribute.eval_distribute` is preferred.
experimental_distribute: An optional
`tf.contrib.distribute.DistributeConfig` object specifying
DistributionStrategy-related configuration. The `train_distribute` and
`eval_distribute` can be passed as parameters to `RunConfig` or set in
`experimental_distribute` but not both.
experimental_max_worker_delay_secs: An optional integer
specifying the maximum time a worker should wait before starting.
By default, workers are started at staggered times, with each worker
being delayed by up to 60 seconds. This is intended to reduce the risk
of divergence, which can occur when many workers simultaneously update
the weights of a randomly initialized model. Users who warm-start their
models and train them for short durations (a few minutes or less) should
consider reducing this default to improve training times.
session_creation_timeout_secs: Max time workers should wait for a session
to become available (on initialization or when recovering a session)
with MonitoredTrainingSession. Defaults to 7200 seconds, but users may
want to set a lower value to detect problems with variable / session
(re)-initialization more quickly.
Raises:
ValueError: If both `save_checkpoints_steps` and `save_checkpoints_secs`
are set.
"""
在 model_main.py
修改 keep_checkpoint_max=500
:
1 | config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, keep_checkpoint_max=500) |
然后在 model_lib.py
中 两处 修改 max_to_keep=500
:
1 | saver = tf.train.Saver( |
使用 Tensorboard
Progress for training and eval jobs can be inspected using Tensorboard. If using the recommended directory structure, Tensorboard can be run using the following command:
1 | tensorboard --logdir=${MODEL_DIR} |
where ${MODEL_DIR
} points to the directory that contains the train and eval directories. Please note it may take Tensorboard a couple minutes to populate with data.
测试验证集
1 | python object_detection/legacy/eval.py --checkpoint_dir /path/to/model --eval_dir /path/to/store/eval_dir \ |
注意.config
文件里 eval
的配置数量要跟验证集样本适量一致。
Exporting a trained model for inference
After your model has been trained, you should export it to a Tensorflow
graph proto. A checkpoint will typically consist of three files:
- model.ckpt-${CHECKPOINT_NUMBER}.data-00000-of-00001
- model.ckpt-${CHECKPOINT_NUMBER}.index
- model.ckpt-${CHECKPOINT_NUMBER}.meta
After you’ve identified a candidate checkpoint to export, run the following
command fromtensorflow/models/research
:1
2
3
4
5
6
7
8
9
10From tensorflow/models/research/
INPUT_TYPE=image_tensor
PIPELINE_CONFIG_PATH={path to pipeline config file}
TRAINED_CKPT_PREFIX={path to model.ckpt}
EXPORT_DIR={path to folder that will be used for export}
python object_detection/export_inference_graph.py \
--input_type=${INPUT_TYPE} \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--trained_checkpoint_prefix=${TRAINED_CKPT_PREFIX} \
--output_directory=${EXPORT_DIR}
NOTE: We are configuring our exported model to ingest 4-D image tensors. We can also configure the exported model to take encoded images or serialized tf.Examples
.
After export, you should see the directory ${EXPORT_DIR} containing the following:
saved_model/
, a directory containing the saved model format of the exported modelfrozen_inference_graph.pb
, the frozen graph format of the exported modelmodel.ckpt.*
, the model checkpoints used for exportingcheckpoint
, a file specifying to restore included checkpoint filespipeline.config
, pipeline config file for the exported model