Esta entrada es parte del curso de Deep learning con PyTorch.
En este post vamos a ver la librería PyTorch-NLP, una librería abierta para procesamiento de lenguaje natural basada en PyTorch y que viene con módulos interesantes de datasets, embeddings preentrenados, codificadores de texto, redes neuronales, etc.
El paquete torchnlp.datasets tiene módulos para descargar, almacenar y cargar datasets para procesamiento de lenguaje natural. Los módulos devuelven objetos torch.utils.data.Dataset, como los que vimos en el post sobre estas clases. Estos objetos tienen métodos para elegir y seleccionar elementos y se puedan pasar al dataloader para el proceso de carga y entrenamiento.
A continuación, instalamos la libreria pytorch-nlp y vemos el dataset IMDB que contiene revisiones de películas con su correspondiente clasificación.
pip install pytorch-nlp
Collecting pytorch-nlp Downloading https://files.pythonhosted.org/packages/4f/51/f0ee1efb75f7cc2e3065c5da1363d6be2eec79691b2821594f3f2329528c/pytorch_nlp-0.5.0-py3-none-any.whl (90kB) |████████████████████████████████| 92kB 4.1MB/s Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pytorch-nlp) (1.18.2) Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from pytorch-nlp) (4.38.0) Installing collected packages: pytorch-nlp Successfully installed pytorch-nlp-0.5.0
import torch from torch.utils.data import Dataset from torchnlp.datasets import imdb_dataset #Creamos un objeto dataset con los datos dataset1 = imdb_dataset(train=True)
#La longitud de la muestra para entrenamiento es de 25000 print(len(dataset1)) #Vemos los cuatro primeros dataset1[0:3]
25000
[{'sentiment': 'pos', 'text': 'Elephant Walk (1954) Starring an early Peter Finch as lord of the manor in some God-forsaken plantation where there is always the danger of elephants or mad Englishmen, staying out in the midday sun and going berserk. Well eventually they do, after the typhoid or cholera outbreak, of course, and much mayhem ensues. Taylor replaced an ailing Vivien Leigh in this pot boiler/adventure flick. When the elephants storm the house and trap Liz on the grand staircase I still get goose bumps. Thank goodness Dana Andrews is around to save the day. One of my favorite guilty pleasures. In color too!'}, {'sentiment': 'pos', 'text': "Another reason to watch this delightful movie is Florence Rice. Florence who? That was my first reaction as the opening credits ran on the screen. I soon found out who Florence Rice was, A real beauty who turns in a simply wonderful performance. As they all do in this gripping ensemble piece. From 1939, its a different time but therein lies the charm. It transports you into another world. It starts out as a light comedy but then turns very serious. Florence Rice runs the gamut from comedienne to heroine. She is absolutely delightful, at the same time strong, vulnerable evolving from a girl to a woman.Watch her facial expressions at the end of the movie. She made over forty movies, and I am going to seek out the other thirty nine. Alan Marshal is of the Flynn/Gable mode and proves a perfect match for Florence. Buddy Ebsen and Una Merkel provide some excellent comic moments, but the real star is Florence Rice. Fans of 30's/40's movies, Don't miss this one!"}, {'sentiment': 'pos', 'text': 'A young woman who is a successful model, and is also engaged to be married, and who has twice attempted suicide in the past, is chosen by a secretive and distant association of Catholic priests to be the next "sentinel" to the gateway to Hell, which apparently goes through a creepy old, but well maintained Brooklyn apartment building. Its tenants take the stairway up and can reincarnate themselves, but apparently can\'t escape as long as a sentinel is there to block the way. The previous one(John Carradine) is about dead, so she, by fate or whatever, becomes the next one, and the doomed must get her to kill herself in order for them to be free. Lots of interesting details lie under the surface, her relationship with her father, the stories of the doomed, her fiancé, so one can pass this off as cheap exploitation horror, but given the sets, the great cast, and overall level of bizarreness, this is definitely worth seeing.'}]
#Vemos la calificaión del primero dataset1[0]['sentiment']
'pos'
A continuación, extraemos los valores del diccionario en dos listas, una para el texto y otra para las etiquetas.
leng=25000 text=[] label=[] for i in range(leng): input=dataset1[i]['text'] output=dataset1[i]['sentiment'] text.append(input) label.append(output)
El paquete torchnlp.encoders tiene clases para codificar texto en tensores y viceversa. A continuación vamos a ver la clase WhitespaceEncoder que codifica un texto separándolo por los espacios en blanco. Importamos el paquete, definimos el texto necesario para construir el diccionario y creamos un objeto WhitespaceEncoder pasándole este texto.
Después mostramos el tamaño del diccionario, la lista de tokens del diccionario y por último codificamos un texto ejemplo usando el objeto creado.
from torchnlp.encoders.text import WhitespaceEncoder loaded_data = ["Esto es un ejemplo de bloque de texto", "Se va a utilizar para tokenizar"] encoder = WhitespaceEncoder(loaded_data) print(encoder.vocab_size) encoder.vocab
18
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'Esto', 'es', 'un', 'ejemplo', 'de', 'bloque', 'texto', 'Se', 'va', 'a', 'utilizar', 'para', 'tokenizar']
encoder.encode("utilizar un ejemplo")
tensor([15, 7, 8])
El paquete torchnlp.word_to_vector tiene varios embeddings preentrenados. A continuación descargamos el modelo Glove. Cargamos la representación (embedding) de la palabra “Hello”, “Hi”, “Welcome” y “Car” y calculamos la diferencia entre dichos vectores usando el error cuadrático medio. Vemos como la palabra más cercana a “Hello” es “Hi” y la más lejana “Car”.
from torchnlp.word_to_vector import GloVe embedd = GloVe()
100%|██████████| 2196017/2196017 [05:50<00:00, 6269.42it/s]
import torch.nn as nn n1=embedd['Hello'] n2=embedd['Hi'] n3=embedd['Welcome'] n4=embedd['Car'] loss = nn.MSELoss() dev1=loss(n1,n2) print(dev1) dev2=loss(n1,n3) print(dev2) dev3=loss(n1,n4) print(dev3)
tensor(0.0457) tensor(0.0792) tensor(0.2429)
Deja una respuesta