Talk:Software packages for text analysis and text mining

From KM4Dev Wiki
Jump to: navigation, search

See the original thread of this E-Discussion on D-Groups: part 1 , part 2 , part 3

Jaap Pels, 2009/03/19

Dear All,

A request to suggest ready-to-run software packages for text analysis / text mining.

Matt Moore, 2009/03/19


What are you looking to do with it?

This kind of software can get very expensive.

At the other extreme, there is something as simple as Wordle

The AIIM did a very interesting report on Findability last year

Wojciech Gryc, 2009/03/19

Hi Jaap and Matt,

Just to follow up, there's a lot of open source and free options as well when it comes to text mining, but it certainly depends on what you're trying to do with it.

For example:

There's other packages that you can use to quickly build text mining prototypes (for example, using Weka )

However, it all depends on your specific goals.

Gabriele Sani, 2009/03/20


I totally agree with Matt: beware, text mining can be extremely expensive, no matter if the software is commercial or opensource. The key point is not the initial price, but what will be the TOTAL cost of applying one technology instead of another. If the maintenance/update of the system proves to be in the order of $100k/year, then you may be better off with a commercial solution... However, it's always a good idea to start looking into opensource solutions, since there offer some VERY competitive tools. For example, if you are looking for a search engine, look into Apache Solr (built on Lucene), as it offers some remarkable capabilities. If you need a web-based solution, Drupal has some amazing search modules (like the above-mentioned Solr). If you need to automatically categorize documents, your best bet is to look into some bayesian engine (this could be used to create/update/maintain taxonomies, too). If you want to monitor the readability of your content, look for tools that will give you the Gunning Fog Index of your texts Quoting Matt again, What are you looking to do with it?

Jaap Pels, 2009/03/20

Hi Matt, Gabriele, Wojcieck and others.

The text analysis tool we are looking for is for a project called WASHCost which aims to collect, collate and verify the quality information relating to the real disaggregated costs in the life-cycle of sustainable water, sanitation, and hygiene service delivery to poor people in rural and peri-urban areas.

The assumption is that advocating to take these cost into account and making this information readily available in a convenient way will change policies of authorities, funding agencies and Implementing agencies.

Measuring such a change is difficult and and one of the way we are considering is the analysis of reports, websites, blog sites, scientific papers and other documents produced by authorities, funding agencies, Implementing agencies and academic institutions to see if their is a change in reported attitudes, discourses, procedures, policy content and maybe even behaviour. The reason we want to automate this analysis are:

  • To have identical approaches in all of the countries we work in as manual analysis will (even with he best will in the world) not be impartial;
  • The amount of documents that will be collected within the projects;
  • The potential other analysis possible but of which we are not aware of.

We assume that this can be done in various ways but the main goal is to study the evolution, the changes over time.

We use a wiki (by SocialText) to keep all information regarding the project so if the tool can be integrated in / can digest HTML page of the wiki that would be a plus but at this stage it is non essential.

Does anybody have more suggestions for (free / cheap) software or research in the above respect?

Matt Moore, 2009/03/20


That sounds actually kinda crazy - the technology for even attempting what you want to do isn't mature yet - let alone affordable. You might be better off giving your manual assessors a strict assessment protocol.

I'm still not 100% sure on what your requirements are. Have you written a spec yet?

Gabriele Sani, 2009/03/20

This time I only partially agree with Matt. If you want a completely automated solution your best bet is to look at some VERY expensive software or services(read: at least several $100k), like Autonomy's IDOL or some buzz monitoring service, with the addition of a review service.... and do not expect to have very solid results.

On the other hand, if you want to have some statistical data and use it to show the amount of interest/coverage/etc then you can basically use a word count system, and create a list of list of keywords/sentences that you have previously identified as relevant. For example, the a timeline graph with the density of the sentence "water sanitation" can be used as a good measure of the interest in this topic. It's something you are probably already familiar with, as "Tag cloud" tools basically do that. But my suggestion if to limit it to a list of keywords to cut the "noise". Also, once you have a reasonable list of keywords that identify the relevant sources, you can push this approach a little farther, adding a list of "negative" and "positive" words. For example, the word "useless" could show a negative approach to water sanitation, but if you look at its presence alone you cannot be sure if it is related to the water sanitation. One possible approach to this issue is to count the distance between the two words, and change the relevance accordingly.

A further improvement on the above method would be to use some automatic categorization tools to analyze the contents. Autonomy's technology is one of them, but there are some other Bayesian categorization systems that are available for free. Also, try looking into the World Bank's BuzzMonitor. It sounds good, but I am not entirely sure of how it works, and I have never tried it. One of the reasons I am mentioning it now is that I would love to hear some info on this. ;)

Probably the easiest/cheapest solution would be to leverage the existing RSS feeds, aggregate them into a web server and then use the now locally collected data to test and develop your analytical tools. A very easy approach would be to install Drupal, throw some rss in its feed aggregator, and then tweak the tagadelic module to see word counts.

Clearly, for all of the above, you need to assess the amount of data you want to analyze, and how far you can push your analysis and still have a reasonable trust in your results.

Wojciech Gryc, 2009/03/20

Hi everyone,

Jaap, developing the type of system you're interested in is very difficult. I'm not familiar with tools like IDOL -- I just know the open source equivalents and a few IBM products from a few internships there. However, I will agree that the type of system you want will be expensive, and is unlikely to work as well most people hope or think it will.

I will second Gabriele's point that developing something with basic statistics (i.e. counting terms or words and doing basic clustering) would be a good idea.

A few questions you might want to consider:

1. *Do you have staff familiar with computer science / math / text mining? (Or can you hire them?)* Even if you use open source tools to cluster your documents, there are numerous models that you can use to do this. A lot of effort usually goes into preprocessing data or understanding why different types of algorithms have clustered your data the way they have... It's my experience that the first few (or even several dozen) attempts at clustering will give you results that seem to make no sense to you. Automatic clustering is a good idea for a start, but is still quite challenging.

2. *How accurate does the system have to be?* Just to give you an idea, text classification systems that deal with diverse data sets and try to classify them into categories that can be quite subjective, such as "Project Progressing Well" or "People Supporting Project A" will be filled with all sorts of errors. I won't get into the math, but sentiment- or opinion-based classification systems can be quite inaccurate without heavy duty research -- even getting to 70% accuracy for classifying what people are saying would be considered state of the art, in this case.

Moral of the story: don't expect anything very accurate.

3. (As per Gabriele's point...) *Are there more basic pieces of information that would be useful for your work?* An example in the 2008 President election was counting how many times Obama and McCain were mentioned in newspapers, and using this as a proxy for popularity of the candidate, rather than actually classifying blog posts as "in favour" or "against" the candidates.

For the sake of brevity, I'll stop there... If you're interested in research papers or reports, I'd be happy to recommend some.

On a side note, to those still reading: I'm a student who recently completed his undergrad in mathematics and international development (double major). I often get asked "What the heck do you hope to do with those two?" -- I'm a long-time lurker and recent poster to the KM4Dev list, and I must say this is *exactly* what I was hoping for.

Matt Moore, 2009/03/20

Wojciech & Gabriele,

Lovely answers! Would it be fair to say that there's getting the tool with its theoretical capabilities and then there's making it do what you want? In some of my efforts with these tools (including an Australian product called Leximancer), you can't simply plug 'n' play. A lot of effort is needed in set-up & interpretation - and even then the outputs can give you insight - or they can be junk. Like any form of data mining really (but more so).

Some comments. Simple words counts are a neat idea. One issue Jaap may find is that if he's dealing with sites in multiple languages then comparisons may become difficult. Even if all the sites are in English then you may have issues with poor translation.

Analysis tools like Autonomy tend to work best with simple, homogeneous text sets. Jaap's texts sound very complicated.

That was why I suggested making sure your manual text analysis processes are up to scratch. The approach (at the moment) is not "how can technology replace our manual assessments" but "how can technology augment them".

As Gabriele notes, RSS can update you when new material appears. Delicious can allow you to tag pages collaboratively (and something like Nvivo allows you to manually code text collaboratively). The text mining could be deployed cautiously to identify broad key word patterns in large documents.

The other option is to look at who is producing the information. Does Jaap have any influence with them to enourage them to produce more structured outputs?

Paul Mundy, 2009/03/20

How about a simple, cheap, low-tech solution: get a student of communication studies or development at the local university to do a content analysis for you? The student gets a thesis out of it, and you get your study, all nicely written up. And the thesis will be a lot more useful than most...