Navegando por Assunto "Natural language processing"

Agora exibindo 1 - 3 de 3

Acesso aberto (Open Access)
Abordagem para o desenvolvimento de um etiquetador de alta acurácia para o Português do Brasil
(Universidade Federal do Pará, 2011-10-21) DOMINGUES, Miriam Lúcia Campos Serra; FAVERO, Eloi Luiz; http://lattes.cnpq.br/1497269209026542
Part-of-speech tagging is a basic task required by many applications of natural language processing, such as parsing and machine translation, and by applications of speech processing, for example, speech synthesis. This task consists of tagging words in a sentence with their grammatical categories. Although these applications require taggers with greater precision, the state of the art taggers still achieved accuracy of 96 to 97%. In this thesis, corpus and software resources are investigated for the development of a tagger with accuracy above of that of the state of the art for the Brazilian Portuguese language. Based on a hybrid solution that combines probabilistic tagging with rule-based tagging, the proposed thesis focuses on an exploratory study on the tagging method, size, quality, tag set, and the textual genre of the corpora available for training and testing, and evaluates the disambiguation of new or out-of-vocabulary words found in texts to be tagged. Four corpora were used in experiments: CETENFolha, Bosque CF 7.4, Mac-Morpho, and Selva Científica. The proposed tagging model was based on the use of the method of transformation-based learning (TBL) to which were added three strategies combined in a architecture that integrates the outputs (tagged texts) of two free tools, Treetagger and -TBL, with the modules that were added to the model. In the tagger model trained with Mac-Morpho corpus of journalistic genre, tagging accuracy rates of 98.05% on Mac-Morpho test set and 98.27% on Bosque CF 7.4 were achieved, both of journalistic genres. The performance of the proposed hybrid model tagger was also evaluated in the texts of Selva Científica Corpus, of the scientific genre. Needs of adjustments in the tagger and in corpora were identified and, as result, accuracy rates of 98.07% in Selva Científica, 98.06% in the text set of Mac-Morpho, and 98.30% in the texts of the Bosque CF 7.4 have been achieved. These results are significant because the accuracy rates achieved are higher than those of the state of the art, thus validating the proposed model to obtain a more reliable part-of-speech tagger.
Acesso aberto (Open Access)
Algoritmos para seleção de metodologias de avaliação de softwares educacionais
(Universidade Federal do Pará, 2023-09-26) CASTILHO, Janize Monteiro de; FARIAS, Fabricio de Souza; http://lattes.cnpq.br/1521079293982268; https://orcid.org/0000-0003-4344-6953
In order to assist the teaching-learning processes, many teachers have decided to use Educational Software (ES) in their classrooms. However, to choose a ES as a teaching resource it is essential to endorse the methodology used by the teacher, once it needs to be pedagogically and functionally appropriate to meet the needs and objectives present in the classroom. Also, it is necessary to use mechanisms that the ES endorses to verify its adequacy to the professor’s objectives. Currently, it is verified that there are various techniques and methodologies available in the literature for ES assessment, but there is still no solution for decision making and selection of a ES that fully addresses the profiles of users and their different needs to be met by certain methodological application, or that arises from demand originating from the development of solutions based on demand and with a low capacity for generalization in terms of practical application. In this way, solutions are available without standardization and that several times do not take into consideration criteria relating to quality, measurement scales and verification procedures of the ES. This heterogeneity makes the evaluation of an ES very difficult, since the subjectivity in the selection of ES evaluation methodology can produce inconclusive results. Given this context, this work created a quality model that considers 24 ES assessment methodologies available in the literature and aims to automate the selection of ES assessment methodology based on the application of artificial intelligence (AI) algorithms, reducing the possibility of subjectivity in the screening process. During the investigation we used Natural Language Processing (NLP), Random Forest, k-Nearest Neighbors and Artificiais Neurais Networks. In all research scenarios, the natural language algorithm was combined with other algorithms, offering a solution based on the application of hybrid and loosely coupled AI algorithms, with excellent results. In this way, simulations were carried out considering NLP+Random Forest, NLP+k-Nearest Neighbors and NLP+Artificial Neurais Networks. After the simulations, the results indicate that it is possible to determine the best ES assessment methodology using AI algorithms, with the best results obtained with the combination of NLP+Random Forest.
Acesso aberto (Open Access)
Aplicação e comparação de técnicas de classificação automática de documentos: um estudo de caso com o dataset do domínio jurídico “Victor”
(Universidade Federal do Pará, 2024-02-01) MARTINS, Victor Simões; SILVA, Cleison Daniel; http://lattes.cnpq.br/1445401605385329; https://orcid.org/0000-0001-8280-2928
The application of Natural Language Processing (NLP) and Artificial Intelligence (AI) in the Brazilian legal context is a rapidly growing area that can alter the way legal professionals work, given the volume of generated text. Among the possible applications of NLP and AI is the automatic classification of documents, which, among other things, can be employed in the automation of the digitization process of Judicial Proceedings that are still in physical form. Therefore, this work applies and compares AI algorithms for the classification of legal documents. The algorithms are divided into two different approaches. The first approach (I) separates the computational representation process of the text from the classifier training itself and applies SVM and Logistic Regression in conjunction with computational representations based on TF-IDF, Word2Vec, FastText, and BERT. The second approach (II) simultaneously performs the computational representation of documents and the training of the classifier, applying Deep Learning algorithms based on recurrent neural networks, specifically ULMFiT (Universal Language Model Fine-tuning), and HAN (Hierarchical Attention Networks). The studied dataset is named VICTOR, composed of documents from the Supreme Federal Court (STF) of Brazil. The research concludes that both approaches can be applied to the classification of legal documents from the employed dataset. Additionally, despite being less computationally expensive, the classification pipelines of Approach I, which use the computational representation of the document with TF-IDF, yield results equivalent to pipelines employing Deep Learning. Furthermore, embedding documents specialization with data from the dataset under study, improves the performance of pipelines that employ Word2Vec, FastText and ULMFiT, compared to pipelines that apply the generic representations of these, i.e., models pre-trained with data from the general context.