[编程开发] 研0应该怎么入门nlp?

[复制链接]
cdra 发表于 2023-10-12 11:53:28|来自:北京 | 显示全部楼层 |阅读模式
研0应该怎么入门nlp?
全部回复4 显示全部楼层
guojun_-2007 发表于 2023-10-12 11:54:26|来自:北京 | 显示全部楼层
CS224N IS ALL YOU NEED
https://web.stanford.edu/class/cs224n/(这门课能认真刷完,作业都写完,就已经领先90%自称搞NLP的研究生了)
lansehai 发表于 2023-10-12 11:54:54|来自:北京 | 显示全部楼层

首先确定一个研究方向

NLP是很大的课题,敲定一个自己感兴趣或是擅长的方向。可以先了解目前各个方向的状况与前景,可以参考学术范。
学术范官方根据发文数量、被引次数等多个维度的数据统计,细分出以下自然语言处理的研究方向。其中最火的还是文本挖掘方向。

  • Text mining
  • Sentence
  • Machine translation
  • Natural language
  • Parsing
  • word error rate
  • Syntax
  • Phrase
  • Noun
  • Lexicono
  • Sentiment analysis
  • lnformation extraction·
  • Language mode
  • lNIST
  • Automatic summarization
  • Rule-based machine translation
  • Realization (linguistics)
  • Question answering
  • Semantic similarity
  • Computational linguistics



详情可以点击链接了解哦~

等对这些有了基本的了解后,再有针对地开始阅读十篇近年来最重要的论文。

推荐下NLP这个领域内最重要的10篇论文吧(依据学术范标准评价体系得出的10篇名单)
(下列文献访问网页后可以使用翻译功能)
一、Deep contextualized word representations
作者:Matthew E. Peters / Mark Neumann / Mohit Iyyer / Matt Gardner / Christopher M. Clark / ... / Luke Zettlemoyer
摘要:We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
全文链接:文献全文 - 学术范 (xueshufan.com)

二、Enriching Word Vectors with Subword Information
作者:Piotr Bojanowski / Edouard Grave / Armand Joulin / Tomas Mikolov
摘要:Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.
全文链接:文献全文 - 学术范 (xueshufan.com)

三、Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
作者:Yonghui Wu / Mike Schuster / Zhifeng Chen / Quoc V. Le / Mohammad Norouzi / ... Jeffrey Dean
摘要:Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
全文链接:文献全文 - 学术范 (xueshufan.com)

四、GloVe: Global Vectors for Word Representation
作者:Jeffrey Pennington / Richard Socher / Christopher D. Manning
摘要:Recent methods for learning vector spacerepresentations  of  words  have  succeededin   capturing   fine-grained   semantic   andsyntactic  regularities  using  vector  arith-metic,  but the origin of these regularitieshas  remained  opaque.    We  analyze  andmake explicit the model properties neededfor  such  regularities  to  emerge  in  wordvectors.   The  result  is  a  new  global  log-bilinear  regression  model  that  combinesthe  advantages  of  the  two  major  modelfamilies  in  the  literature:   global  matrixfactorization  and  local  context  windowmethods.  Our model efficiently leveragesstatistical information by training only onthe nonzero elements in a word-word co-occurrence matrix, rather than on the en-tire sparse matrix or on individual contextwindows in a large corpus. The model pro-duces a vector space with meaningful sub-structure, as evidenced by its performanceof 75% on a recent word analogy task.  Italso outperforms related models on simi-larity tasks and named entity recognition.
全文链接:文献全文 - 学术范 (xueshufan.com)

五、Sequence to Sequence Learningwith Neural Networks
作者:Ilya Sutskever / Oriol Vinyals / Quoc V. Le
摘要:Deep Neural Networks (DNNs) are powerful models that have achieved excel-lent performance on difficult learning tasks. Although DNNswork well wheneverlarge labeled training sets are available, they cannot be used to map sequences tosequences.  In this paper, we present a general end-to-end approach to sequencelearning that makes minimal assumptions on the sequence structure. Our methoduses a multilayered Long Short-Term Memory (LSTM) to map theinput sequenceto a vector of a fixed dimensionality, and then another deep LSTM to decode thetarget sequence from the vector.  Our main result is that on anEnglish to Frenchtranslation task from the WMT’14 dataset, the translationsproduced by the LSTMachieve a BLEU score of 34.8 on the entire test set,  where the LSTM’s BLEUscore was penalized on out-of-vocabulary words. Additionally, the LSTM did nothave difficulty on long sentences.  For comparison, a phrase-based SMT systemachieves a BLEU score of 33.3 on the same dataset.  When we usedthe LSTMto rerank the 1000 hypotheses produced by the aforementioned SMT system, itsBLEU score increases to 36.5, which is close to the previous best result on thistask.  The LSTM also learned sensible phrase and sentence representations thatare sensitive to word order and are relatively invariant to the active and the pas-sive voice.  Finally, we found that reversing the order of thewords in all sourcesentences (but not target sentences) improved the LSTM’s performance markedly,because doing so introduced many short term dependencies between the sourceand the target sentence which made the optimization problemeasier.
全文链接:文献全文 - 学术范 (xueshufan.com)

六、The Stanford CoreNLP Natural Language Processing Toolkit
作者:Christopher D. Manning / Mihai Surdeanu / John Bauer / Jenny Finkel / Steven J. Bethard / David McClosky
摘要:We  describe  the  design  and  use  of  theStanford  CoreNLP  toolkit,  an  extensiblepipeline  that  provides  core  natural  lan-guage analysis. This toolkit is quite widelyused, both in the research NLP communityand also among commercial and govern-ment  users  of  open  source  NLP  technol-ogy.   We  suggest  that  this  follows  froma  simple,  approachable  design,  straight-forward  interfaces,  the  inclusion  of  ro-bust  and  good  quality  analysis  compo-nents,  and  not  requiring  use  of  a  largeamount of associated baggage.
全文链接:文献全文 - 学术范 (xueshufan.com)

七、Distributed Representations of Words and Phrases and their Compositionality
作者:Tomas Mikolov / Ilya Sutskever / Kai Chen / Greg Corrado / Jeffrey Dean
摘要:The recently introduced continuous Skip-gram model is an efficient method forlearning high-quality distributed vector representations that capture a large num-ber of precise syntactic and semantic word relationships. In this paper we presentseveral extensions that improve both the quality of the vectors and the trainingspeed.  By subsampling of the frequent words we obtain significant speedup andalso learn more regular word representations.  We also describe a simple alterna-tive to the hierarchical softmax called negative sampling.An inherent limitation of word representations is their indifference to word orderand their inability to represent idiomatic phrases.  For example, the meanings of“Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivatedby this example, we present a simple method for finding phrases in text, and showthat learning good vector representations for millions of phrases is possible.
全文链接:文献全文 - 学术范 (xueshufan.com)

八、Natural Language Processing (Almost) from Scratch
作者:Ronan Collobert / Jason Weston / Léon Bottou / Michael Karlen / Koray Kavukcuoglu / ... Pavel P. Kuksa
摘要:We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
全文链接:文献全文 - 学术范 (xueshufan.com)

九、Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation
作者:David M. W. Powers
摘要:Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.
全文链接:文献全文 - 学术范 (xueshufan.com)

十、Glove: Global Vectors for Word Representation
作者:Piotr Bojanowski / Edouard Grave / Armand Joulin / Tomas Mikolov
摘要:Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
全文链接:文献全文 - 学术范 (xueshufan.com)

接着还可以访问学术范上计算机自然语言处理方向的页面:

Natural language processing详情-学术范 (xueshufan.com)
学术范从数据(历年来的发文及被引数据)维度展现了NLP的发展趋势,同时提供顶刊、顶会、顶尖学者等各类榜单,帮助你快速地了解该领域的研究分布情况。

最后,推荐以下五位NLP界的大佬:

(学术范官方根据发文数量、被引次数等多个维度的数据统计得出)
1、Christopher D. Manning
所属机构:Stanford University
数据如下:


关于 Christopher D. Manning 的学术成果,请访问:学者学术成果 - 学术范 (xueshufan.com)

2、Tomas Mikolov
所属机构:Czech Technical University in Prague
数据如下:


关于Tomas Mikolov 的学术成果,请访问:学者学术成果 - 学术范 (xueshufan.com)

3、Richard Socher
所属机构:http://Salesforce.com
数据如下:


关于Richard Socher 的学术成果,请访问:学者学术成果 - 学术范 (xueshufan.com)

4、Ilya Sutskever
所属机构:OpenAI
数据如下:


关于Ilya Sutskever 的学术成果,请访问:学者学术成果 - 学术范 (xueshufan.com)

5、Jeffrey Dean
所属机构:Google
数据如下:


关于Jeffrey Dean 的学术成果,请访问:学者学术成果 - 学术范 (xueshufan.com)

希望可以帮到你~~
<hr/>学术范:一站式文献搜索、阅读批注的学术讨论
刘德华摸周杰伦 发表于 2023-10-12 11:55:46|来自:北京 | 显示全部楼层
我是一路自学摸索过来的,可以简单讲讲我的经历和建议,但不一定适合每一个人,仅供参考吧!
<hr/>我保研之前完全是深度学习+机器学习+NLP 零基础,当时自己摸索着入门走了很多弯路。
首先,不要花很多很多时间去硬啃传统机器学习的所有细节(数学推导),目前的NLP科研中很少用到传统机器学习的东西了,主要是以了解思想为主!(打开我的知乎文章可以看到我之前写的很多传统机器学习数学推导的笔记,当时花了很多很多时间,但现在也忘的差不多了,甚至没什么帮助!
其次,市面上有非常非常多的网课,特别是很多收费的,一定不要买!有非常多免费且高质量的网课!
我自己听过很多网课,综合来说,比较推荐的有:
1. 李宏毅机器学习2022 Spring李宏毅《机器学习》国语课程(2020)_哔哩哔哩_bilibili上面第一个是最新版,该课程的官网,附带Homework,但视频链接是Youtube,可能很多小伙伴无法打开。第二个是B站的搬运课程,其实内容差不多。
推荐只听课程部分,李宏毅老师能把深度学习的思想和细节讲的非常清楚,但是Homework内容有点多,需要花很多时间,而且需要有pytorch基础,零基础直接上来做是很困难的。
学完这个课程后理论部分就没问题了,但是和实践还有一定距离!
2.【完结】动手学深度学习 PyTorch版这是李沐老师2021年发布的深度学习视频,其中所有东西都手把手逐行讲解了代码,跟着课程自己独立把里面的代码部分都敲一遍!
做完这部分基本的编程能力就没问题了!
接下来需要入手NLP部分了,理论部分推荐学一遍Stanford CS224n,我认为这是NLP讲的最好的课程之一!
3. 【双语字幕】斯坦福CS224n《深度学习自然语言处理》课程(2019) by Chris Manning按照惯例,该NLP实践了,这里推荐邱锡鹏老师的NLP-Beginner:自然语言处理入门练习
4. NLP-Beginner:自然语言处理入门练习恭喜你,做完以上部分,你已经成功入门了NLP!
<hr/>多说几句,打好基础以后如何再进一步开始nlp research呢?
首先,你需要和导师商量选一个具体研究方向!(比如机器翻译、情感分析、信息抽取等等
接下来,推荐先找一篇你这个方向最新的综述看一遍,基本上就知道整个发展脉络、各种类型的方法之间的区别了。
然后就从最新的顶会上寻找你自己方向的论文,从新到旧,按着参考文献往回读吧!
再分享一下如何找论文,最好只看顶会论文!!包括:ACL,EMNLP,NAACL,ICLR,NIPS,ICML,(COLING,EACL)基本上集中在加粗的这6大顶会上
分享一下我自己入坑一个新领域时如何找论文,首先在GitHub上找找看有没有别人总结好的paper list,下面这个 @Gordon Lee 大佬的repo总结了绝大部分NLP方向的paper list,可以先在这里找找看。
GitHub - Doragd/Awesome-Paper-List: A curated list of repositories in which many NLP/CV/ML papers and related area resources are collected.那么读完了list上论文后怎么找最新论文呢?
推荐两个网站:

  • arxiv的CL板块,很多论文在被accept之前都会放到arxiv上,不夸张的说你可以在这里看到最新的研究成果!
Computation and Language2. ACL Anthology,这里汇总了几乎所有NLP相关会议、期刊的论文合集!
https://aclanthology.org/<hr/>做完上面这部分可以发文章了,那么如何选择发什么会议呢?ddl是什么呢?
推荐在下面这个网站选择会议,查询ddl!
AI Conference Deadlines
爱你的人是我 发表于 2023-10-12 11:55:54|来自:北京 | 显示全部楼层
动手学深度学习,你值得拥有。培训班的那种视频,能别碰就别碰。

快速回帖

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则