ACTG: Help and Frequently Asked Questions (FAQs)
HELP
Here we present a guide for the use and interpretation of data generated from the ACTG application.
Mapping of tags.
ACTG tag mapping: a step by step explanation.
Some technical information about ACTG.
Rules for selecting the best database for mapping your tag list and rules for ranking the tag to gene mapping.
How are the databases in ACTG generated?
What are the influences of redundant cDNAs in the virtual tag databases?
Output files.
ACTG output data nomenclature.
FAQ
Here we present general questions about using the ACTG application and the interpretation of the data. However, if you didn't find answer for your question, please send us an e-mail via our comments page.
What is a virtual tag?
What should I do if my favorite tag is mapped to different cDNAs or different UniGene clusters?
What is the non-redundant ACTG tag list? How does ACTG creates this dataset?
What is the difference between SAGE and MPSS?
What exactly is SBS?
What is the difference between MPSS and SBS?
What data can ACTG map?
What should be the input data format?
No tag was mapped, what is the problem?
Can I map a list of tags with two or more columns, where each column contains the tag frequency from one library
Why ACTG presents several files in the output? What does each file contain?
Is the output file comma or tab delimited?
Are all outputs file in the same format?
What do the symbols '*', '%' or '#' mean?
What is the putative artifactual tags list?
Is the ACTG freely available?
What do I do if I want to publish a work using ACTG?
How often will be the ACTG dataset updated?
HELP
The mapping of tag is an automated process made by a set of programs of the ACTG tool (for detail, see Some technical information about ACTG). ACTG maps: short SAGE (10 bps in length), long SAGE (17 bps in length), short MPSS (13 bps in length), long MPSS (16 bps in length) and SBS (16 bps in length) tags for both human and mouse organisms. The user's tag list can be mapped using 17 different databases, including data from SAGE Genie, SAGEmap, RefSeq mRNAs, MGC mRNAs, dbESTs ESTs and a set of non-redundant virtual tags (see: ).
The user tag list submission and the database selection are simple and fairly quick. The mapping process and the assembly of results, which is done by ACTG, also are quick. A link to the result will be available through an e-mail sent to the user and the results will be stored in ACTG web site for 24 hours.
ACTG tag mapping: a step by step explanation.
Steps to map a tag list using ACTG:
1 - Starting the ACTG mapping. Click on the 'Submit your tag list and run ACTG' in the ACTG main page (close to the Query in the left side menu).

2 - Submitting a tag list. There are two options: the user can submit directly a file or copy and paste his/her tag list. Only plain text data is allowed.

3 - Typing an e-mail address. This e-mail address will receive a link to the ACTG mapping results.

4 - Selecting a data type. You should select an organism (human or mouse) and a tag type (Short SAGE, long SAGE, short MPSS, long MPSS or SBS). The user should select an organism and the tag type equivalent to the organism and tag type of his/her tag list.

5 - Selecting the databases. The user should select at least one database for mapping purposes. Note that data from SAGE Genie and SAGEmap are available only for SAGE.

6 - Running ACTG. The user has to click on “Run ACTG” to initiate the processing. A link to the results will be available through a message sent to the email address provided. The mapping process is quick with most of the processes taking less than 10 minutes.

7 - Getting the results. Open the ACTG e-mail, click on the link, download the results and unzip the files.

8 - Opening the results. The output files are in plain text format (comma separated), a universal format, easy to open in any text editor (for example, Note Pad, MS-Word, OpenOffice or GNU-emacs) or in any electronic spreadsheet (for example, MS-Excel and OpenOffice) or any web browser (for example, MS-Internet Explorer or FireFox).

9 - Exploring your mapping result. There is a file for each database selected (for example, Hs_MGC_short_tag_mrnas_PolyATail_PolyASignal_result.txt), a file containing the merging of all databases (file name: all_mapping_result_merged.txt) and a file with the statistics of the mapping result (file name: StatisticsOfMapping.txt. This file is not zipped). For a numeric result of the mapping, see the statistical result file with the number of tags mapped in each database, the mapping of tags in the merged database and the number of putative artifactual tags (What is the putative artifactual tags list?). All submitted tags are displayed in the result files. The non-mapped tags are located at the end of the file.
Some technical information about ACTG.
ACTG is a collection of several Perl scripts, Perl + CGI scripts, shell scripts and html codes. An important step in the ACTG assembly is the virtual tag databases construction. Basically, for each ACTG dataset, a set of Perl scripts and shell scripts commands perform the download of raw data, the parsing of cross-referenced information and produces the final assembly of the databases in the ACTG format.Another important point is the processing of information submitted to ACTG. A set of Perl + CGI scripts processes all information submitted by the user, makes several check points (for example, if the e-mail was typed, a database was selected etc) and passes these information for the set of mapping programs.
The ACTG core is the set of Perl scripts responsible for the mapping of tags. Basically, these programs cross-reference the user tag list (real tag) and the virtual tags databases, selecting only a perfect virtual tag - real tag match. The final results are the tag to gene assignment (one file for each database selected), a file containing the statistics of mapping and a file merging all mapped databases.
Apply these rules in order to obtain a reliable tag to gene mapping.
1. Choose tags mapped to mRNA sequences containing poly(A) tail and poly(A) signal or only poly(A) tail or tags mapped to data from SAGE Genie or SAGEmap. Data from SAGE Genie and SAGEmap.
2. Choose tags mapped at EST sequences containing poly(A) tail and poly(A) signal or only poly(A) tail.
3. Choose tags mapped to sequences (from mRNAs database) containing poly(A) signal.
4. Choose tags mapped to mRNAs containing neither poly(A) tail nor poly(A) signal.
If there are multiple maps of the same tag, choose the best mapping following the four rules presented previously.
How are the databases in ACTG generated?
ACTG assembles data from three main datasets: SAGE Genie, SAGEmap, and virtual tags from almost all public cDNAs sequences from GenBank. Below describes details about the construction of each virtual tag database used in ACTG.-
-virtual tags from SAGE Genie: first, files were downloaded from SAGE Genie (http://cgap.nci.nih.gov/SAGE/) containing the virtual tags matched to UniGene clusters (files: Hs_short.best_gene, for human short SAGE; Hs_long.best_gene for human long SAGE; Mm_short.best_gene for mouse short SAGE; Mm_long.best_gene for mouse long SAGE). Second, each file was parsed and converted the raw data into ACTG format. Among all datasets available for SAGE Genie, only the best mapping for a tag was used, because this data contains a reliable tag to gene mapping and does not present with any redundancy. SAGE Genie chooses the best mapping for a tag based on the classification of the tag in a previously ranked virtual tag database. The ranking of databases is based on the percentage of virtual tags that is represented in a confident SAGE tag list (tags that are reliably observed in the transcriptome). For a complete description of the SAGE Genie protocol, refer to Boon et al. (2002) and the SAGE Genie website.
-virtual tags from SAGEmap: first, files were downloaded from SAGEmap (http://www.ncbi.nlm.nih.gov/projects/SAGE/) containing reliable virtual tags matched to UniGene clusters (file: SAGEmap_tag_ug-rel.zip for human and mouse). Second, the raw data was processed to select only the best tag assignment, as in the SAGE Genie data. This selection is based on the tag to gene SAGEmap score, a number that increase with the reliability of tag to UniGene cluster match. The score is computed based on the tag sequence complexity and on the number of cDNA sequences assigned to the tag. For a complete description of SAGEmap protocol, refer to Lash et al. (2000) and the SAGEmap web site .
-virtual tags from cDNAs sequences: first, all cDNAs data from UniGene (Hs.data and Hs.seq.all for human; Mm.data and Mm.seq.all for mouse), RefSeq (rna.fa for human and mouse.rna.fna for mouse), MGC (hs_mgc_mrna.fasta for human and mm_mgc_mrna.fasta for mouse) and dbESTs (est_human for human and est_mouse for mouse) were downloaded. For RefSeq and MGC (both containing only mRNAs sequences) sequences containing either poly(A) tail (at least 5 adenosines at the cDNA 3' end), poly(A) signal (AAUAAA or AUUAAA at the 3' most 50bp segment) or both, poly(A) tail and poly(A) signal, were selected. For the UniGene mRNAs sequences (UniGene contains mRNAs, HTC and ESTs cDNAs), those containing either poly(A) tail, poly(A) signal or both, poly(A) tail and poly(A) signal were selected. For dbEST, those sequences with poly(A) tail and those sequences with poly(A) tail and poly(A) signal were selected. Next, for all these cDNAs datasets, the 3' most virtual tag (SAGE, MPSS and SBS type) were selected. Finally, to form each ACTG virtual tags databases, the virtual tag and other informations (for example gene name, UniGene cluster) available for each cDNA were linked.
The main datasets are assembled to contain basically the same pattern of information: tag sequence, UniGene cluster, tag frequency (if the user submitted this information), gene name, a brief gene annotation, genomic locus, cDNA sequence ID and the database name. For example:
GAAGTGTGTC, 3, Hs.5298, ADIPOR1, Adiponectin receptor 1, 1p36.13-q41, BC001594.2, MGC, MGC_tail+signal
We have chosen this pattern because it contains the most sufficient information for the identification of assigned gene and was easily to integrate with other applications.
An important observation is that SAGE Genie and SAGEmap provides virtual tags only for short and long SAGE, not for MPSS and SBS technology.
What are the influences of redundant cDNAs in the virtual tag databases?
As ACTG uses all public cDNAs from human and mouse organisms (the most represented organism in the cDNAs databases), we have many redundant sequences. Why has ACTG kept the redundancy? First, the users will have access to the complete public dataset, second, a high redundancy suggests, at least, a highly studied gene, and third, mainly for tags mapped only by ESTs, a high redundancy indicates a high quality of the tag (without genomic contamination or internal priming). Finally, the users can always exclude this redundancy if they consider only the UniGene cluster, not the cDNA sequence ID.In addition to the cDNA database redundancy, some ACTG datasets share some sequences (for example mRNAs from UniGene and MGC database). Why ACTG also introduce this redundancy? This redundancy is a consequence of ACTG allowing the user to map their data using the most commonly used public cDNA databases. Below, there is a brief description of database and their relevance/importance.
The UniGene is the most used clustering method (UniGene contains many sequences from RefSeq, MGC and dbEST). RefSeqs is the most used collection of full length cDNAs (many RefSeq and MGCs sequences are identical). MGC is the most important project of full length sequencing and the dbESTs is the most complete repository of ESTs for both human and mouse organisms.
However, if the user doesn’t want redundancy, they can map the tag list in the non-redundant ACTG virtual tags (for further detail, refer to: "What is the non-redundant ACTG tag list? How does ACTG creates this dataset?"). This database does not contain redundancy at the cDNA level.
-
For each tag list mapping, these are the files generated by ACTG.
Result of mapping: These files contain the output of the mapping procedures. There is a file for each database selected for the mapping. For example:
Merged data: This file contains a merging of databases chose by the user. The user can see, in the same file, all tags mapped (the last column indicates the database name). The file is ordered by tag sequences. For example:
Statistical file: This file contains numerical information about the mapping process. There are information about the number of mapped tags, the number of un-mapped tags and the number of putative artifactual tags. For example:
ACTG output data nomenclature:
All ACTG results are named according to the following convention:All files containing the results of mapping are named according to the following convention:
myTAGs_Hs_MGCpolyAtail_shortSAGE.csv
Codes: "Hs:" Homo sapiens; "Mm:" Mus musculus; "polyAtailSignal": sequences containing poly(A) tail and poly(A) signal; "polyAtail" = sequences containing poly(A) tail; "polyAsignal" = sequences containing poly(A) signal; "none" = sequences containing neither poly(A) signal nor poly(A) tail.
FAQ
A virtual tag is a prediction of a tag (10bps for short SAGE, 17bps for long SAGE, 13bps for short MPSS, 16bps for long MPSS and SBS) from a transcript sequence that would be observed in a SAGE, MPSS or SBS experiment (transcript that contains a poly(A) tail and a NlaIII site (CATG) for SAGE or a DpnII site (GATC) for MPSS and SBS).
What should I do if my favorite tag is mapped on different cDNAs or different UniGene clusters?
If ACTG mapped your favorite tag on two different cDNAs (for example, the SAGE tag "ATCGGGCCCG" is assigned to NM_016558 and NM_033630), but this cDNAs are in the same UniGene cluster (Both cDNAs, NM_016558 and NM_033630, are in the cluster Hs.584909), you do not have a problem. This 'redundancy' is acceptable and does not influence the tag to gene assignment. This will be frequent if you mapped a tag list on different databases, for example, from data from MGC and RefSeq (refer to What are the influences of redundant cDNAs in the virtual tag databases? section). However, if your favorite tag is mapping on two different UniGene clusters, you may have a problem because this sequence represent an ambiguous tag to gene assignment. If ACTG added an "*" in the tag sequence, there is a strong evidence of ambiguity (refer to: "What is the putative artifactual tags list?") and this 'redundancy' can't be removed. If ACTG not added an "*" in the tag sequence, there isn't a strong evidence of ambiguity and this 'redundancy' can be removed by mapping the user tag list in the non-redundant tag list (see "What is the non-redundant ACTG tag list? How does ACTG creates this dataset?" or the users can remove this 'redundancy' using some rules (see Rules for selecting the best database for mapping your tag list and rules for ranking the tag to gene mapping).What is the non-redundant ACTG tag list? How does ACTG creates this dataset?
The non-redundant ACTG tag list is a set of non-redundant virtual tags from all ACTG virtual tag databases. The construction of this list is based on the merging and removing the redundancy of all ACTG datasets. We removed this redundancy at two levels. First, we removed the redundancy at the cDNA level, selecting only one copy of virtual tags derived from distinct cDNAs present in the same UniGene cluster. Second, we removed the redundancy at the UniGene cluster level, by the analyzing of all tag mapped on two or more UniGene clusters, and selecting only the match with highest reliability. The rank of match reliability is given by the origin of the virtual tag, where a match of tag to UniGene cluster derived from the RefSeq database is more reliable than a match of tag to cluster derived from the dbEST database. For example, if a virtual tag is mapped on the UniGene cluster Hs.2, when we use data from RefSeq sequences with poly(A) tail and poly(A) signal, but the same tag is mapped on clusters Hs.515046, using data from dbEST, we include, in the non-redundant tag list, only the match of tag to the cluster Hs.2. This is the order of dataset reliability (from the highest to the lowest), 1) virtual tags from all mRNAs with poly(A) tail and poly(A) signal, 2) virtual tags from mRNAs with poly(A) signal, 3) virtual tags from SAGEGenie, 4) virtual tags from SAGEmap, 5) virtual tags from ESTs with a poly(A) tail and poly(A) signal, 6) virtual tags from ESTs with a poly(A) tail, and 7) virtual tags from mRNAs with only a poly(A) signal.If a virtual tag, from the same dataset (for example RefSeq with poly(A) tail), matched to different UniGene clusters, we selected the clusters containing the greatest number of cDNAs reporting that match. However, if a virtual tag match from the dataset 'mRNAs with poly(A) tail and poly(A) signal' (the most reliably database) matched to two or more UniGene clusters, we kept this 'redundancy' and mark the tag like an 'ambiguous tag'. We not remove this redundancy because an ambiguous tag may represent the expression of several genes (a false-positive expression), and is fundamental to the users to have this information.
In addition, all virtual tags from the non-redundant tag list are ranked based on three categories of tag to gene assignments: i) high reliability, ii) medium reliability, and iii) low reliability. This classification (ranking), is dependent of the datasets that originated the virtual tag. All putative artifactual tags, similar to SAGE linker sequence and ambiguous tags, were classified as low reliability. Every virtual tags from mRNAs with only poly(A) signal and virtual tags from SAGE Genie/SAGEmap that are not present in the set of high reliability, were classified as medium reliability. Every virtual tags from mRNAs with poly(A) tail were classified as high reliability (here, we have several virtual tags also present in SAGE Genie and SAGEmap datasets). We believed that this non-redundant tag list should simplify the tag mapping process and the user's interpretation of final results.
What is the difference between SAGE and MPSS?
Both, SAGE and MPSS, present a similar output, a list of short sequences (tags) and a frequency for each tag. However, the method of obtaining the tag list is dramatically different. SAGE uses concatenated tags that are sequenced using a traditional automated DNA sequencing method. The most common SAGE tag length is ~10 or 17bps and a good library may contain more than 50,000 tags (average of 100,000 tags). In contrast, MPSS uses a novel cloning and sequencing method whereby hundreds of thousands of sequences are obtained simultaneously by sequencing off of beads using a technique of enzymatic digestion and hybridization. The most common MPSS tag length is 13 or 16bps and a library may contain more than 1,200,000 tags. You can find more details about SAGE and MPSS in Velculescu et al. (1995) and Brenner et al. (2000), respectively.Sequencing-By-Synthesis (SBS) is a new powerful technology available for the quantitative measurement of gene expression. Individual mRNAs are identified through a 16 bps sequence, immediately adjacent to the 3' end of the 3' most Dpn II restriction site in cDNA sequences. SBS utilizes four proprietary fluorescently labeled modified nucleotides to sequence millions of fragments of cDNA in parallel a single base at a time (Solexa, http://www.solexa.com/wt/page/index). As MPSS, a SBS library contains more than 1 million of tags.
What is the difference between MPSS and SBS?
There is an economical advantage of SBS over MPSS. SBS technology generates up to one billion bases of data per run at a more economical cost compared to MPSS (as of September 2006, MPSS technology is being substituted by SBS. For details, see Solexa company).ACTG can map: short SAGE (10 bp in length), long SAGE (17 bp in length), short MPSS (13 bp in length), long MPSS (16 bp in length) and SBS (16 bp in length) tags for both human and mouse genomes.
What should be the input data format?
The input information (tag list) can be space, tab or comma delimited.No tag was mapped, what is the problem?
-
If your tag list was not mapped, please make sure:
- you submitted (tag list) and selected the same tag type in ACTG web page (for example, if you submitted a Long SAGE tag list, you should select 'Long SAGE' in 'Tag type box').
- you removed the enzyme site ("ACTG" for SAGE and "GATC" for MPSS and SBS technologies) from each tag sequence (for example, Short SAGE should be 10 bps long and SBS should be 16 bps long).
Why does ACTG generate multiple output files? What does each output file contain?
The final result contains one file for each mapped database and two additional files. If the user choses to map the tag list using the tags from SAGEGenie and tags from UniGene mRNAs, the final result will generate one file for tags mapped at SAGEGenie tags and another for tags mapped at UniGene mRNAs tags. The first additional file contains all data merged and the second additional file provides simple statistics about the mapping process.Is the output file comma or tab delimited?
Are all outputs file in the same format?
Yes, all files containing the results of tag mapping are in the same format and contain these information: tag sequence, tag frequency (if the user submitted this information), UniGene cluster, gene name, a brief gene annotation, genomic locus, cDNA sequence ID and the database name. For example:-
GAAGTGTGTC, 3, Hs.5298, ADIPOR1, Adiponectin receptor 1, 1p36.13-q41, BC001594.2, MGC, MGC_tail+signal
What do the symbols '*', '%' or '#' mean?
'*' identifies tags mapped to multiples genes, the ambiguous tags.'#' identifies tags similar to the linker used in the SAGE library construction (including 1 bp variation).
What is the putative artifactual tags list?
The putative artifactual tag list is a set of tags that not presents a reliable tag to gene assignment. ACTG artifactual tag list has two sets of data. The first set contains tags that are similar (allowing 1 miss-match) to the sequence of linker utilized in the construction of SAGE libraries. If occur a contamination of the linker sequences in the SAGE library construction, the software for SAGE data extraction (for example SAGE2000, www.sagenet.org) will remove these artifactual tags. However, if also occur a sequencing error and these artifactual tags not present the usual sequence, its will not be removed. This problem is solved in ACTG, we also filter sequences containing 1 miss-match to the linker sequence (the miss-match hasn't filtered real tags). The symbol '#' identifies these tags.The second set contains tags that map in two or more different UniGene clusters, here called ambiguous tags. These tags are selected by the analysis and comparison of virtual tag from all mRNAs with poly(A) tail and with poly(A) tail and poly(A) signal (the most reliable tag to cDNA mapping). Is important identify an ambiguous tag, because its expression can report multiple genes. The symbol '*' identifies these tags.
Yes, ACTG is freely available. For commercial use, please send us an e-mail.
What do I do if I want to publish a work using ACTG?
Please cite ACTG paper.How often will be the ACTG dataset updated?
An automated update will be conducted three times a year. The versions of ACTG and its datasets are in the 'Data Release' information web page.