[TIP] Words Tokenization

April 24, 2009 Matteo Bertozzi | Filed Under Tips | No Comments

Sometimes is useful to split the input text in a list of words to Indexing or Searching data.
Here is how to extract words from a sentence in C.

char str[] = "Hi, I'm a test. (This is just a test). "
             "Join The #qt IRC Channel!"
             "GNU/Linux - theo@gmail.com";
char delims[] = " !\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~";

char *result = NULL;
result = strtok(str, delims);
while(result != NULL) {
    printf("%s\n", result);
    result = strtok(NULL, delims);
}


…and this is the Qt way.

QString str = "Hi, I'm a test. (This is just a test). "
        "Join The #qt IRC Channel! GNU/Linux - theo@gmail.com";
QString delim = QRegExp::escape(" !\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~");
QRegExp regexp(QString("[%1]").arg(delim),Qt::CaseSensitive,QRegExp::RegExp2);
qDebug() << str.split(regexp, QString::SkipEmptyParts);

No Comments yet »

RSS feed for comments on this post. TrackBack URI

Leave a comment

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>