How To Install To Install Sentiment_classifier, Nltk, Numpy, Sentiwordnet In Anaconda Prompt10/13/2019
New to PythonThis article is for software developers — particularly those coming from a Ruby or Java language background — who are facing their first machine learning implementation. The challenge: Use machine learning to categorize RSS feedsI was recently given the assignment to create an RSS feed categorization subsystem for a client. The goal was to read dozens or even hundreds of RSS feeds and automatically categorize their many articles into one of dozens of predefined subject areas.
The content, navigation, and search functionality of the client website would be driven by the results of this daily automated feed retrieval and categorization.The client suggested using machine learning, perhaps with Apache Mahout and Hadoop, as she had recently read articles about those technologies. Her development team and ours, however, are fluent in Ruby rather than Java™ technology. This article describes the technical journey, learning process, and ultimate implementation of a solution. What is machine learningMy first question was, “what exactly is machine learning?” I had heard the term and was vaguely aware that the supercomputer IBM® Watson had recently used it to defeat human competitors in a game of Jeopardy.
In this article, I would like to demonstrate how we can do text classification using python, scikit-learn and little bit of NLTK. Disclaimer: I am new to machine learning and also to blogging (First).
As a shopper and social network participant, I was also aware that both Amazon.com and Facebook do an amazingly good job of recommending things (such as products and people) based on data about their shoppers. In short, machine learning lies at the intersection of IT, mathematics, and natural language. It is primarily concerned with these three topics, but the solution for the client would ultimately involve the first two:. Classification. Assigning items to arbitrary predefined categories based on a set of training data of similar items.
Recommendation. Recommending items based on observations of similar items. Clustering. Identifying subgroups within a population of dataThe Mahout and Ruby detoursArmed with an understanding of what machine learning is, the next step was to determine how to implement it. As the client suggested, Mahout was an appropriate starting place. I downloaded the code from Apache and went about the process of learning machine learning with Mahout and its sibling, Hadoop. Unfortunately, I found that Mahout had a steep learning curve, even for an experienced Java developer, and that working sample code didn’t exist.
Also unfortunate was a lack of Ruby-based frameworks or gems for machine learning. Finding Python and the NLTKI continued to search for a solution and kept encountering “Python” in the result sets. As a Rubyist, I knew that Python was a similar object-oriented, text-based, interpreted, and dynamic programming language, though I hadn’t learned the language yet. In spite of these similarities, I had neglected to learn Python over the years, seeing it as a redundant skill set. Python was in my “blind spot,” as I suspect it is for many of my Rubyist peers.Searching for books on machine learning and digging deeper into their tables of contents revealed that a high percentage of these systems use Python as their implementation language, along with a library known as the Natural Language Toolkit (NLTK).
Further searching revealed that Python was more widely used than I had realized—such as in Google App Engine, YouTube, and websites built with the Django framework. It even comes preinstalled on the Mac OS X workstations I use daily! Furthermore, Python offers interesting standard libraries (for example, NumPy and SciPy) for mathematics, science, and engineering. Who knew?I decided to pursue a Python solution after I found elegant coding examples. The following one-liner, for example, is all the code needed to read an RSS feed through HTTP and print its contents. Show more Show more icon Getting up to speed on PythonIn learning a new programming language, the easy part is often learning the language itself. The harder part is learning its ecosystem—how to install it, add libraries, write code, structure the code files, execute it, debug it, and write unit tests.
This section provides a brief introduction to these topics; be sure to check out the Resources section for links to more information. PipThe Python Package Index ( pip) is the standard package manager for Python. It’s the program you use to add libraries to your system.
It’s analogous to gem for Ruby libraries. To add the NLTK library to your system, you enter the following command. Show more Show more icon virtualenvMost Rubyists are familiar with the issue of system-wide libraries, or gems.
A system-wide set of libraries is generally not desirable, as one of your projects might depend on version 1.0.0 of a given library, while another project depends on version 1.2.7. Likewise, Java developers are aware of this same issue with a system-wide CLASSPATH. Like the Ruby community with its rvm tool, the Python community uses the virtualenv tool (see the Resources section for a link) to create separate execution environments, including specific versions of Python and a set of libraries. The commands in Listing 2 show how to create a virtual environment named p1env for your p1 project, which contains the feedparser, numpy, scipy, and nltk libraries.Listing 2.
Commands to create a virtual environment with virtualenv. Show more Show more icon The code base structureAfter graduating from simple single-file “Hello World” programs, Python developers need to understand how to properly structure their code base regarding directories and file names. Each of the Java and Ruby languages has its own requirements in this regard, and Python is no different. In short, Python uses the concept of packages to group related code and provide unambiguous namespaces. For the purpose of demonstration in this article, the code exists within a given project root directory, such as /p1. Within this directory, there exists a locomotive directory for a Python package of the same name. Listing 3 shows this directory structure.Listing 3.
Example directory structure. Show more Show more iconThe code in Listing 5 also demonstrates a distinguishing feature of Python: All code must be consistently indented or it won’t compile successfully. The tearDown(self) method might look a bit odd at first. You might wonder why the test is hard-coded always to pass? Actually, it’s not. That’s just how you code an empty method in Python. ToolingWhat I really needed was an integrated development environment (IDE) with syntax highlighting, code completion, and breakpoint debugging functionality to help me with the Python learning curve.
As a user of the Eclipse IDE for Java development, the pyeclipse plug-in was the next tool I looked at. It works fairly well though was sluggish at times. I eventually invested in the PyCharm IDE, which meets all of my IDE requirements.Armed with a basic knowledge of Python and its ecosystem, it was finally time to start implementing a machine learning solution.
Implementing categorization with Python and NLTKImplementing the solution involved capturing simulated RSS feeds, scrubbing their text, using a NaiveBayesClassifier, and classifying categories with the kNN algorithm. Each of these actions is described here. Capturing and parsing the feedsThe project was particularly challenging, because the client had not yet defined the list of target RSS feeds. Thus, there was no “training data,” either. Therefore, the feed and training data had to be simulated during initial development.The first approach I used to obtain sample feed data was simply to fetch a list of RSS feeds specified in a text file. Python offers a nice RSS feed parsing library called feedparser that abstracts the differences between the various RSS and Atom formats. Another useful library for simple text-based object serialization is humorously called pickle.
Both of these libraries are used in the code in Listing 6, which captures each RSS feed as “pickled” object files for later use. As you can see, the Python code is concise and powerful.Listing 6. The CaptureFeeds class.
Show more Show more iconThe next step was unexpectedly challenging. Now that I had sample feed data, it had to be categorized for use as training data. Training data is the set of data that you give to your categorization algorithm so that it can learn from it.For example, the sample feeds I used included ESPN, the Sports Network. One of the feed items was about Tim Tebow of the Denver Broncos football team being traded to the New York Jets football team during the same time the Broncos had signed Peyton Manning as their new quarterback.
Another item in the feed results was about the Boeing Company and its new jet. So, the question is, what specific category value should be assigned to the first story? The values tebow, broncos, manning, jets, quarterback, trade, and nfl are all appropriate. But only one value can be specified in the training data as its category. Likewise, in the second story, is the category boeing or jet?
The hard part is in those details. Accurate manual categorization of a large set of training data is essential if your algorithm is to produce accurate results. The time required to do this should not be underestimated.It soon became apparent that I needed more data to work with, and it had to be categorized already—and accurately. Where would I find such data?
Enter the Python NLTK. In addition to being an outstanding library for language text processing, it even comes with downloadable sets of sample data, or a corpus in their terminology, as well as an application programming interface to easily access this downloaded data. To install the Reuters corpus, run the commands shown below. More than 10,000 news articles will be downloaded to your /nltkdata/corpora/reuters/ directory. As with RSS feed items, each Reuters news article contains a title and a body, so this NLTK precategorized data is excellent for simulating RSS feeds.
Show more Show more icon Natural language is messyThe raw input to the RSS feed categorization algorithm is, of course, text written in the English language. Raw, indeed.English, or any natural language (that is, spoken or ordinary language) is highly irregular and imprecise from a computer processing perspective.
First, there is the matter of case. Is the word Bronco equal to bronco? The answer is maybe.
Next, there is punctuation and whitespace to contend with. Equal to bronco or bronco,? Then, there are plurals and similar words. Are run, running, and ran equivalent?
Well, it depends. These three words have a common stem. What if the natural language terms are embedded within a markup language like HTML? In that case, you have to deal with text like bronco. Finally, there is the issue of frequently used but essentially meaningless words like a, and, and the. These so-called stopwords just get in the way. Natural language is messy; it needs to be cleaned it up before processing.Fortunately, Python and NLTK enable you to clean up this mess.
The normalizedwords method of class RssItem, in Listing 7, deals with all of these issues. Note in particular how NLTK cleans the raw article text of the embedded HTML markup in just one line of code!
A regular expression is used to remove punctuation, and the individual words are then split and normalized into lowercase.Listing 7. The RssItem class. Show more Show more iconNLTK also offers several “stemmer” classes to further normalize the words. Check out the NLTK documentation on stemming, lemmatization, sentence structure, and grammar for more information. Classification with the Naive Bayes algorithmThe Naive Bayes algorithm is widely used and implemented in the NLTK with the nltk.NaiveBayesClassifier class.
The Bayes algorithm classifies items per the presence or absence of features in their datasets. In the case of the RSS feed items, each feature is a given (cleaned) word of natural language. The algorithm is “naive,” because it assumes that there is no relationship between the features (in this case, words).The English language, however, contains more than 250,000 words. Certainly, I don’t want to have to create an object containing 250,000 Booleans for each RSS feed item for the purpose of passing to the algorithm.
So, which words do I use? In short, the answer is the most common words in the population of training data that aren’t stopwords.
NLTK provides an outstanding class, nltk.probability.FreqDist, which I can use to identify these top words. In Listing 8, the collectallwords method returns an array of all the words from all training articles.This array is then passed to the identifytopwords method to identify the most frequent words. A useful feature of the nltk.FreqDist class is that it’s essentially a hash, but its keys are sorted by their corresponding values, or counts. Thus, it is easy to obtain the top 1000 words with the :1000 Python syntax.Listing 8. Using the nltk.FreqDist class. Show more Show more icon Becoming less naiveAs stated earlier, the algorithm assumes that there is no relationship between the individual features.
Thus, phrases like “machine learning” and “learning machine” or “New York Jet” and “jet to New York” are equivalent ( to is a stopword). In natural language context, there is an obvious relationship between these words. So, how can I teach the algorithm to become “less naive” and recognize these word relationships?One technique is to include the common bigrams (groups of two words) and trigrams (groups of three words) in the feature set. It should now come as no surprise that NLTK provides support for this in the form of the nltk.bigrams(.) and nltk.trigrams(.) functions. Just as the top n-number of words were collected from the population of training data words, the top bigrams and trigrams can similarly be identified and used as features.
Your results will varyRefining the data and the algorithm is something of an art. Should you further normalize the set of words, perhaps with stemming? Or include more than the top 1000 words?
Or use a larger training data set? Add more stopwords or “stop-grams”? These are all valid questions to ask yourself. Experiment with them, and through trial and error, you will arrive at the best algorithm for your data.
I found that 85 percent was a good rate of successful categorization. Recommendation with the k-Nearest Neighbors algorithmThe client wanted to display RSS feed items within a selected category or similar categories. Now that the items had been categorized with the Naive Bayes algorithm, the first part of that requirement was satisfied. The harder part was to implement the “or similar categories” requirement. This is where machine learning recommender systems come into play.
Recommender systems recommend an item based on similarity to other items. Amazon.com product recommendations and Facebook friend recommendations are good examples of this functionality.k-Nearest Neighbors (kNN) is the most common recommendation algorithm. The idea is to provide it a set of labels (that is, categories), and a corresponding dataset for each label.
The algorithm then compares the datasets to identify similar items. The dataset is composed of arrays of numeric values, often in a normalized range from 0 to 1. It can then identify the similar labels from the datasets. Unlike Naive Bayes, which produces one result, kNN can produce a ranked list of several (that is, the value of k) recommendations.I found recommender algorithms simpler to comprehend and implement than the classification algorithms, although the code was too lengthy and mathematically complex to include here. Refer to the excellent new Manning book, Machine Learning in Action, for kNN coding examples (see the Resources section for a link). In the case of the RSS feed item implementation, the label values were the item categories, and the dataset was an array of values for each of the top 1000 words.
Again, constructing this array is part science, part math, and part art. The values for each word in the array can be simple zero-or-one Booleans, percentages of word occurrences within the article, an exponential value of this percentage, or some other value. ConclusionDiscovering Python, NLTK, and machine learning has been an interesting and enjoyable experience. The Python language is powerful and concise and now a core part of my developer toolkit. It is well suited to machine learning, natural language, and mathematical/scientific applications.
Although not mentioned in this article, I also found it useful for charting and plotting. If Python has similarly been in your blind spot, I encourage you to take a look at it. Social.Contents.Resources. Learn more about machine learning from Wikipedia. Check out Python's official website. Read Peter Harrington's Machine Learning in Action (Manning, 2012). Check out Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper (O'Reilly, 2009).
Check out Implement Bayesian inference using PHP (Paul Meagher, developerWorks, March-May 2009).
A twitter sentiment classifier based on Support Vector Machines and K nearest neighbors algorithms Overall decriptionAs undestood from the title, this repository contains sources codes (src folder), datasets (data folder) and useful resources for twitter sentiment analysis (resources folder).The training dataset is split into 3 files containing a processed version of tweets in the three classes: positive (data/used/positive1.csv), negative (data/used/negative1.csv) and neutral (data/used/neutral1.csv)The training dataset is collected SemEval challenge ( ), STS gold and Sanders dataset.
Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |