Discussion:
[tug-summer-of-code] A couple of project proposals
Daniel Kirsch
2009-06-29 08:22:17 UTC
Permalink
2) LaTeX handwriting-based symbol search
There are occasional posts to comp.text.tex asking by
description for a particular LaTeX symbol ("a capital U with a
dot...no, not with the dot above the U but centered within
it...no, not such a small dot, a bigger dot..."). It'd be a
great help to produce a Web site in which a user can draw a
symbol with the mouse and have the site return a list of LaTeX
symbols (glyph + package + control sequence) that best match the
user's drawing.
Hi,
I think I did something like that:

http://detexify.kirelabs.org

A friend of mine had this exact same idea and I just began it out of
interest. Actually I am a TeX beginner myself and my solution is far
from finished and leaves a lot to be desired. There are not too many
Symbols trained as of now either but it works reasonably well on greek
letters. I'd be glad to hear some feedback.
Daniel Kirsch
Joseph Wright
2009-06-29 14:41:34 UTC
Permalink
Post by Daniel Kirsch
2) LaTeX handwriting-based symbol search
There are occasional posts to comp.text.tex asking by
description for a particular LaTeX symbol ("a capital U with a
dot...no, not with the dot above the U but centered within
it...no, not such a small dot, a bigger dot..."). It'd be a
great help to produce a Web site in which a user can draw a
symbol with the mouse and have the site return a list of LaTeX
symbols (glyph + package + control sequence) that best match the
user's drawing.
Hi,
http://detexify.kirelabs.org
A friend of mine had this exact same idea and I just began it out of
interest. Actually I am a TeX beginner myself and my solution is far
from finished and leaves a lot to be desired. There are not too many
Symbols trained as of now either but it works reasonably well on greek
letters. I'd be glad to hear some feedback.
Daniel Kirsch
Hello Daniel,

Very nice indeed! A bit of publicity to get some training done, and
this could be a very useful tool indeed.
--
Joseph Wright
Scott Pakin
2009-06-29 19:45:34 UTC
Permalink
Daniel,
Post by Daniel Kirsch
2) LaTeX handwriting-based symbol search
There are occasional posts to comp.text.tex asking by
description for a particular LaTeX symbol ("a capital U with a
dot...no, not with the dot above the U but centered within
it...no, not such a small dot, a bigger dot..."). It'd be a
great help to produce a Web site in which a user can draw a
symbol with the mouse and have the site return a list of LaTeX
symbols (glyph + package + control sequence) that best match the
user's drawing.
Hi,
http://detexify.kirelabs.org
A friend of mine had this exact same idea and I just began it out of
interest. Actually I am a TeX beginner myself and my solution is far
from finished and leaves a lot to be desired. There are not too many
Symbols trained as of now either but it works reasonably well on greek
letters.
Very cool! That's just what I was thinking of when I proposed
handwriting-based symbol search for Google Summer of Code. If you
don't mind my asking, about how long did it take you to write that?
(I don't know if you read the rest of this thread, but we had some
discussion about whether this project would take far more than a
summer to complete, and your answer can certainly help us calibrate
project scope for next year.)
Post by Daniel Kirsch
I'd be glad to hear some feedback.
I trained a few symbols on your site and noticed that many of the
symbols are accented letters, just because there are so many of them:
\acute{a}, \acute{b}, \acute{c}, ..., \acute{z}, \hat{a}, \hat{b},
..., \hat{z}, etc. Maybe you could bias the training requests towards
some of the more obscure or hard-to-name symbols?

Have you already trained the program on the typeset versions of the
symbols, or do you require handwritten input?


Once again, good job! I hope you manage to get the program trained on
lots of symbols in the near future. If you can think of some code
enhancements that might be suitable for a GSoC project next summer,
please post them to this list; I'm sure that'll spark some good
discussion.

-- Scott
Daniel Kirsch
2009-06-30 14:10:58 UTC
Permalink
Very cool! ?That's just what I was thinking of when I proposed
handwriting-based symbol search for Google Summer of Code. ?If you
don't mind my asking, about how long did it take you to write that?
(I don't know if you read the rest of this thread, but we had some
discussion about whether this project would take far more than a
summer to complete, and your answer can certainly help us calibrate
project scope for next year.)
I am sorry to say that I have absolutely no idea how much time the
whole thing took. I have to stress that I was new to both pattern
recognition and inexperienced in LaTeX when I started detexify (first
commit on May 24). So the most time was spent getting into the topic
and I am still not sure if I chose the best
technologies/algorithms/features.
I trained a few symbols on your site and noticed that many of the
\acute{a}, \acute{b}, \acute{c}, ..., \acute{z}, \hat{a}, \hat{b},
..., \hat{z}, etc. ?Maybe you could bias the training requests towards
some of the more obscure or hard-to-name symbols?
I have now integrated the training into the searching wich makes more
sense anyway. The old training is still available and will always
offer the symbols with the least samples to be trained.
Have you already trained the program on the typeset versions of the
symbols, or do you require handwritten input?
Everything is based on handwritten input. That was a performance
decision. I have experimented with analyzation of image data and found
it to be too slow (in ruby with rmagick at least).
Once again, good job! ?I hope you manage to get the program trained on
lots of symbols in the near future.
I would really like to support the Comprehensive List of LaTeX Symbols
but as already noted I am not very experienced with LaTeX. The System
should work with any kind of hand-drawn symbol but right now my
problem is that I don't know how to get all these symbols rendered for
the web. I am using MathTeX (http://www.forkosh.com/mathtex.html) to
render the Symbols.

Daniel
Mojca Miklavec
2009-06-30 18:04:57 UTC
Permalink
Post by Daniel Kirsch
Post by Scott Pakin
I trained a few symbols on your site and noticed that many of the
\acute{a}, \acute{b}, \acute{c}, ..., \acute{z}, \hat{a}, \hat{b},
..., \hat{z}, etc. ?Maybe you could bias the training requests towards
some of the more obscure or hard-to-name symbols?
I have now integrated the training into the searching wich makes more
sense anyway.
Thanks. That makes much more sense.
Post by Daniel Kirsch
The old training is still available and will always
offer the symbols with the least samples to be trained.
It would make sense to ask the user if he wants to find:
a) a standalone symbol (\alpha, \times, \int, ...)
b) or an accent

It makes no sense to treat a caron or macron or acute accent ... on
each and every letter separately.

In case a) you may offer an empty box and in case b) you draw an
imaginary symbol (could be a gray circle) in the background, so that
user can place an accent on it. User is probably only interested in
knowing the command for placing accent. I don't think that you really
need to recognize the letter along with the accent. It makes the task
more difficult and it leaves you with many more choices at the end.

Even if you cannot recognize the symbol properly, you can still
display "caron", "circumflex", "tilde" and "macron" on the same page
and the approximate information will still be useful. If you need to
recognize the letter along with it, not only will be the recognition
less accurate, but say that you would recognize the letter as being
"o", "O", "a", "q" or "Q" - > you would be left with 20 possible
outcomes which would make it difficult to display all the possible
choices.
Post by Daniel Kirsch
Post by Scott Pakin
Have you already trained the program on the typeset versions of the
symbols, or do you require handwritten input?
Everything is based on handwritten input. That was a performance
decision. I have experimented with analyzation of image data and found
it to be too slow (in ruby with rmagick at least).
I don't know what technology you use in the background, so I don't
understand what exactly makes the process slow. You should be able to
extract at least the "main strokes/features" (something like a graph
showing where lines are connected or disconnected) from typeset data.
Post by Daniel Kirsch
Post by Scott Pakin
Once again, good job! ?I hope you manage to get the program trained on
lots of symbols in the near future.
I would really like to support the Comprehensive List of LaTeX Symbols
but as already noted I am not very experienced with LaTeX. The System
should work with any kind of hand-drawn symbol but right now my
problem is that I don't know how to get all these symbols rendered for
the web. I am using MathTeX (http://www.forkosh.com/mathtex.html) to
render the Symbols.
You should not worry. This step is easier than any other you have made
so far. Since the number of symbols is limited, you could simply
pre-generate them all (for example create dvi or png and then convert
using dvitopng or ghostscript to convert from PDF to PNG).

Mojca
Daniel Kirsch
2009-07-04 08:28:14 UTC
Permalink
Post by Mojca Miklavec
a) a standalone symbol (\alpha, \times, \int, ...)
b) or an accent
It makes no sense to treat a caron or macron or acute accent ... on
each and every letter separately.
In case a) you may offer an empty box and in case b) you draw an
imaginary symbol (could be a gray circle) in the background, so that
user can place an accent on it. User is probably only interested in
knowing the command for placing accent. I don't think that you really
need to recognize the letter along with the accent. It makes the task
more difficult and it leaves you with many more choices at the end.
This is another very good suggestion. I will separate symbol and
accent search and I have already removed the accented characters from
the current version as that cluttered the whole thing too much.
Post by Mojca Miklavec
Post by Daniel Kirsch
I would really like to support the Comprehensive List of LaTeX Symbols
but as already noted I am not very experienced with LaTeX. The System
should work with any kind of hand-drawn symbol but right now my
problem is that I don't know how to get all these symbols rendered for
the web. I am using MathTeX (http://www.forkosh.com/mathtex.html) to
render the Symbols.
You should not worry. This step is easier than any other you have made
so far. Since the number of symbols is limited, you could simply
pre-generate them all (for example create dvi or png and then convert
using dvitopng or ghostscript to convert from PDF to PNG).
I am working on a solution now and I think I have almost understood
what is needed. From what I have seen so far for a particular symbol I
need to know

package,
\command,
fontenc
and if it is made for math mode.

The last point is something I have problems with. How do I know which
symbols work in math mode and which don't? My understanding so far is
that there are some latex2e and ams symbols that work in both math an
text mode and the rest of the symbols is either math or text mode.
Math mode symbols are the ones from section 3 of the comprehensive
symbols list and all the others are text mode.
What bothers me most is that non math mode symbols will get rendered
without an error in math mode but the outcome is not as desired (so I
can't find out programmatically).

http://github.com/kirel/comprehensive-latex-yaml/blob/0c691d0c77606275d697a0be422394c370fe35d5/symbols.yaml
is a first idea to organize the symbols so that I can generate all the
graphics with a script. That also allows me to add necessary
information to recognized symbols such as the package needed. Am I
missing something important?

Daniel
Scott Pakin
2009-07-04 18:41:57 UTC
Permalink
Daniel,
Post by Daniel Kirsch
I am working on a solution now and I think I have almost understood
what is needed. From what I have seen so far for a particular symbol I
need to know
package,
\command,
fontenc
and if it is made for math mode.
How about the number and type of arguments \command takes?
Post by Daniel Kirsch
The last point is something I have problems with. How do I know which
symbols work in math mode and which don't? My understanding so far is
that there are some latex2e and ams symbols that work in both math an
text mode and the rest of the symbols is either math or text mode.
Math mode symbols are the ones from section 3 of the comprehensive
symbols list and all the others are text mode.
What bothers me most is that non math mode symbols will get rendered
without an error in math mode but the outcome is not as desired (so I
can't find out programmatically).
I don't believe it's possible in the general case to determine
automatically if a given macro produces a text-mode or math-mode
symbol (or either). Going by "Section 3 = math; non-Section 3 = text"
in the CLSL might not be such a bad way to start. Another possibility
is to look at how each symbol is defined in its .sty file.
\DeclareMathSymbol, \DeclareMathDelimiter, \DeclareMathAccent, and
\DeclareMathRadical, and even the primitive \mathchar give you plenty
of information. Text symbols exhibit more variety in definition,
unfortunately, but you may be able to assume that if a symbol isn't
clearly math, then it's probably text. Yet another possibility is to
trace the use of a symbol. Build a little LaTeX document (using
-interaction=nonstopmode, just in case) that loads a package and
places a symbol, maybe somewhat like this:

\documentclass{article}
\usepackage{<SOME PACKAGE>}
\begin{document}

\[
\tracingmacros=2
\typeout{BEGIN SYMBOL TEST}
<SOME SYMBOL>
\typeout{END SYMBOL TEST}
\tracingmacros=0
\]

\end{document}

Then search the log file for the macro expansions between the "BEGIN
SYMBOL TEST" and "END SYMBOL TEST" lines. If "math" occurs, then
<SOME SYMBOL> is likely to be a math symbol; if not, it's likely to be
a text symbol.

Whatever approach you end up taking will certainly require some manual
effort to verify the results.

-- Scott

P.S. Another useful set of symbols to include would be a
blackboard-bold font (especially the uppercase letters and the digit
"1"). Most people resort to creative ways to describe what they're
looking for because they don't know the term "blackboard bold".

Loading...