WMT16¶

class paddle.text. WMT16 ( data_file: str | None = None, mode: _Wmt16DataSetMode = 'train', src_dict_size: int = - 1, trg_dict_size: int = - 1, lang: _Wmt16Language = 'en', download: bool = True ) [source]

Implementation of WMT16 test dataset. ACL2016 Multimodal Machine Translation. Please see this website for more details: http://www.statmt.org/wmt16/multimodal-task.html#task1

If you use the dataset created for your task, please cite the following paper: Multi30K: Multilingual English-German Image Descriptions.

@article{elliott-EtAl:2016:VL16,
 author    = {{Elliott}, D. and {Frank}, S. and {Sima"an}, K. and {Specia}, L.},
 title     = {Multi30K: Multilingual English-German Image Descriptions},
 booktitle = {Proceedings of the 6th Workshop on Vision and Language},
 year      = {2016},
 pages     = {70--74},
 year      = 2016
}

Parameters

data_file (str|None) – path to data tar file, can be set None if download is True. Default None.
mode (str) – ‘train’, ‘test’ or ‘val’. Default ‘train’.
src_dict_size (int) – word dictionary size for source language word. Default -1.
trg_dict_size (int) – word dictionary size for target language word. Default -1.
lang (str) – source language, ‘en’ or ‘de’. Default ‘en’.
download (bool) – whether to download dataset automatically if data_file is not set. Default True.

Returns

Instance of WMT16 dataset. The instance of dataset has 3 fields:

src_ids (np.array) - The sequence of token ids of source language.
trg_ids (np.array) - The sequence of token ids of target language.
trg_ids_next (np.array) - The next sequence of token ids of target language.

Return type

Dataset

Examples

>>> import paddle
>>> from paddle.text.datasets import WMT16

>>> class SimpleNet(paddle.nn.Layer):
...     def __init__(self):
...         super().__init__()
...
...     def forward(self, src_ids, trg_ids, trg_ids_next):
...         return paddle.sum(src_ids), paddle.sum(trg_ids), paddle.sum(trg_ids_next)

>>> wmt16 = WMT16(mode='train', src_dict_size=50, trg_dict_size=50)

>>> for i in range(10):
...     src_ids, trg_ids, trg_ids_next = wmt16[i]
...     src_ids = paddle.to_tensor(src_ids)
...     trg_ids = paddle.to_tensor(trg_ids)
...     trg_ids_next = paddle.to_tensor(trg_ids_next)
...
...     model = SimpleNet()
...     src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next)
...     print(src_ids.item(), trg_ids.item(), trg_ids_next.item())
89 32 33
79 18 19
55 26 27
147 36 37
106 22 23
135 50 51
54 43 44
217 30 31
146 51 52
55 24 25

get_dict ( lang: _Wmt16Language, reverse: Literal[True] = False ) → dict[int, str] get_dict¶

get_dict ( lang: _Wmt16Language, reverse: Literal[False] = False ) → dict[str, int]

get_dict ( lang: _Wmt16Language, reverse: bool = False ) → dict[int, str] | dict[str, int]

return the word dictionary for the specified language.

Parameters

lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
reverse (bool) – If reverse is set to False, the returned python dictionary will use word as key and use index as value. If reverse is set to True, the returned python dictionary will use index as key and word as value.

Returns

The word dictionary for the specific language.

Return type

dict

Examples

>>> from paddle.text.datasets import WMT16
>>> wmt16 = WMT16(mode='train', src_dict_size=50, trg_dict_size=50)
>>> en_dict = wmt16.get_dict('en')